What’s That? Problems with Automated Image Detection


Technology companies such as Google and Microsoft have developed software algorithms that recognize images and automatically generate descriptions for them. Computer-generated descriptions are improving, but human-generated image descriptions are still the gold standard for creating accessible content. People know their own content, and they can determine the essential details about photographs that are needed when writing descriptions of the images.

The software algorithms compare images against a large dataset of photos. In this post, I discuss two instances of image recognition in action.

•            Identifying images for blind people

•            Cataloging photographs in a large collection of images produced by a museum.

In the first case, screen readers that provide voice output for blind people identify images that aren’t associated with a human-generated label. The descriptions are simplistic “photo may contain person, tree”. Facebook identifies photos of my friend Joe as “glasses”. The automated image detection can’t describe his other physical characteristics.

People can easily describe Joe with more words than “glasses”. They might refer to his blond hair, or to the color of his shirt, or describe the scene in the background of the picture.

Alt text,” short for alternative text, is an HTML attribute that labels an image with a description. Alt text can be read by screen readers (voice output) software used by people who are blind or have low vision.

Despite the ongoing efforts of blind people and our allies, there are millions of undescribed images on the web and in social media posts. Companies like Google and Microsoft advertise automated image detection as a temporary solution to this problem.

In the second case, specialized software algorithms are used to identify images in museum collections. To be effective, the software algorithms require a custom-made training set of images that show items found in the museums collection.

I have shown that automated image detection is not accurate for image description. It is less accurate for complex images containing unique objects.

Here are the automated image descriptions and the human-generated captions for two photos. One picture is of 3D-printed replicas in a museum exhibit, and the other photo is a cannon at a historic fort.

13 3D-printed replica stone points, with3D-printed replicas of projectile points attached by lanyard to QR codes. QR code coins.

The first photo shows 3D-printed replicas of stone projectile points (spear tips and arrowheads) that are found in the collections of the Maryland Archaeological and Conservation Laboratory. The replica projectile points are triangular in shape, and they vary in length.

When I added the photo to a Microsoft Word document, The algorithm described it as a mountain range. It focused on the triangular shape of the projectile points and the variation in length.

The alt text that we wrote describes a set of 3D-printed replicas of stone projectile points.

Different software Algorithms may produce different descriptions of the same image. I will use this photo of a cannon as an example, and then I will compare these automated descriptions with the accurate human-generated description.

bronze cannon at Fort Ticonderoga Museum

In Google chrome, The automated image detection for this cannon is “green and brown metal pipe”. The Google algorithm recognized that the barrel of the cannon is hollow, but it identified the object as a pipe.

The Microsoft algorithm described the same image as “a close-up of a sword”. It did not recognize that the cannon’s barrel is hollow.

The algorithms developed by Google and Microsoft focused on the long metal shape in the photo, but they ignored the cannon’s wheels.

The algorithm developed by Apple described the photo as “sports equipment”. It correctly identified the parts of the cannon, wheels, and the long metal component, but it could not classify the metal as a barrel.

Here is the human-generated caption for this photo. “A deep green colored bronze cannon sits on a wooden carriage with two small wooden wheels.” Photo Credit, Fort Ticonderoga Museum Collection.

The algorithms developed by Google, Microsoft, and Apple are designed for general use. They are designed to identify common objects that occur in photos like people, clothing and trees. It is not surprising that commercially-available software algorithms cannot recognize images that contain unique objects like a historic cannon.

In conclusion, always check automated image descriptions, or better yet, follow best practices for accessibility and write accurate descriptions, alt text, for images before publishing them.

Further Reading

Adding alt text to images is one of the 10 Best Practices of Accessible Museum Websites  

We wrote this article about 3D-printed replicas.

Designing a portable museum display of Native American stone projectile points (arrowheads) to ensure accessibility and tactile quality.

This article discusses case studies of image recognition from art and science museums.

This is not an apple! Benefits and challenges of applying computer vision to museum collections

Here are some posts from technology companies about automated image description. These links are being provided for informational purposes only; they do not constitute an endorsement of products by MuseumSenses.

 October 11 2019.

Using AI to give people who are blind the “full picture”

October 14 2020.

What’s that? Microsoft’s latest breakthrough, now in Azure AI, describes images as well as people do

January 19 2021.

How Facebook is using AI to improve photo descriptions for people who are blind or visually impaired

March 14 2022.

Look up what’s in a photo with your iPhone or iPad