Using GPT to generate descriptions of photos based on photo analysis

So I had this idea; chatGPT seems to understand code fairly well and it seems to understand the concept of content an context. It also seems to be pretty good at generating a description or story based on very little input data. So with that in mind I decided to see what would happen if I took the output data from one of the many photo analysis tools that are out there and let chatGPT generate a description of the photo.

It sounds as a really good idea doesn’t it? Take a photo of something, let some online service analyse it and provide me with a JSON containing the objects and persons in the photo and let chatGPT massage that JSON into a description of the photo. For the visually impaired this would be awesome.

Great idea, worth testing and so I did. There are several API’s available out there for photo analysis including services from Google, Microsoft, Amazon, api4ai, Clarifai and SentiSight.ai. To make it easy for myself I used a Aden AI’s playground where I can test one photo with all those services at a time to save some time but also to get an idea of their capabilities.

As it turned out they were all pretty lousy but in the end Amazons service seemed to have the least lousy average track record. I’m actually quite surprised how bad they all performed and there is obvious room for improvement.

Once I had my JSON I created my prompt:

Don't mention the json or what anything is based on. This json describes persons and items and properties in a photo: Based on the json describe the persons in the picture and other items and where they are located in the picture. Describe the weather and location based on the label properties in the json. Use the coordinates in the json to determine where the different objects are located in the photograph and how big they are. if a person or object is unusually small compared to other objects in the photo then it is probably in the background and if unusually large it is probably in the foreground. When you describe where an object is located describe it like in this example "the ball is located in the lower left region of the photo". If you find properties that describes a probably scenario such as beach and summer you might draw the conclusion that the photo is shot at the beach. properties with x and y values set to null are not items in the picture they describe the general area and surroundings.

There are certainly room for improvement but it is a start. so I let chatCPT massage it into text and I was again disappointed:

Based on the JSON, we can see that there are two main objects in the photo: a person and an animal. The person takes up the entire photo as their x_min and y_min values are 0, and their x_max and y_max values are 1. This indicates that the person is front and center in the photo and is likely the main subject of the photo.

The other object in the photo is an animal, which is described as a "Wildlife" and "Amphibian", specifically a "Salamander". The confidence score for this label is 0.827, which indicates a moderate level of certainty in the label.

There are also several labels that describe body parts, such as "Body Part", "Finger", and "Hand". These labels do not provide any additional information about the location or context of the photo.

The JSON does not provide any information about the location or weather in the photo. There are some properties with x and y values set to null, which describe the general area and surroundings but do not provide any specific information.

Since there is only one person in the photo and they take up the entire frame, it is difficult to determine the location or context of the photo beyond the presence of the animal. The salamander could be a pet or could be found in the wild.

The JSON does not provide any information about the size of the objects in the photo, as the x_min, x_max, y_min, and y_max values for most labels are null. Therefore, we cannot determine the relative sizes or positions of the objects in the photo.

Completely worthless for my user case. I tried several version and it continued to perform just badd in so many ways. So, in pure desperation I turned to OpenAI’s playground and tested it on the text-davinci-003 model which is a bit newer than the one that chatGPT uses. And finally I had a result:

The photo depicts a person standing in a beach setting with a salamander perched on their finger. The person is located in the center of the photo, wearing a white t-shirt and shorts. The salamander is located on the person's finger in the lower right region of the photo. The weather in the photo appears to be sunny and warm, with a bright blue sky and white, fluffy clouds. The location of the photo is likely a beach, as evidenced by the sand, water, and palm trees in the background.

Much better except for that it makes up things that are not in the photo nor in the JSON such as in this case fluffy clouds and palm trees. So I turned down the temperate setting to zero. The temperature settings decides how much randomness that should be used in the response by the language model. And with a temperature of zero I got a better response more true to the actual photo:

The photo shows a person standing in a grassy field with a salamander in their hand. The person is wearing a light colored shirt and shorts and is standing in the lower right region of the photo. The salamander is located in the person's hand and is unusually small compared to the person, indicating that it is in the background. The weather is sunny and the location is a grassy field.

It’s not perfect but it is better. GPT’s ability to determine the size of objects seems to be really bad even when I included a description in the prompt about how the coordinates in the JSON worked.

CONCLUSIONS

Available photo analysis is surprisingly bad.
chatGPT and GPT is really bad at logic so it messes up relations between objects. This wasn’t really a surprise.
The text-davinci-003 model is more refined and provides better answers than the one on chatGPT.
The temperature setting should not be ignored.