GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential for deriving results quickly and efficiently. Some data is harder to process, and whilst structured data has been harnessed and lends itself to automated ETL/ELT processing, unstructured data presents some challenges. Businesses might store photographs taken during the production process that need to be checked for defects, e.g., images to show the completion of a stage in a construction process; or images to monitor and look for defects in objects that occur over time. All this data needs to be reviewed and classified, which can be a time-consuming process. Wouldn’t it be great if there was an automated tool that would read the images and classify them, with the capabilities to streamline the manual review process quickly, flagging up any issues requiring closer examination?

GPT-4 with Vision and Custom Vision are tools that show great potential to solve the above problems. I have tried out both and will showcase the results in this blog.

Tools of the trade

First rolled out in November 2023, GPT-4 with Vision (GPT-4V) is another addition to the large multimodal models (LMMs) family. Unlike other versions of GPT (GPT-3, GPT-3.5 etc) it accepts images and will answer questions about the images in detail. You can find out more information about it here.

Custom Vision, on the other hand, is an image recognition service where you can upload your images, tag them, and train the model to either classify an image or detect a particular object. It also provides probability attached to tags.

Setting the stage

To experiment with these two tools I have used images of high-voltage power lines from a publicly available source:

Power lines networks consist of pylons that carry high-voltage overhead cables used to transport electricity. An important part of monitoring the network is taking drone images of insulator glass caps, they can look like this:

These caps ensure cables are insulated and the pylon doesn’t become live itself, so it’s important to keep them in good shape and replace any missing or damaged caps. Drones generate a lot of these images, of varying clarity and quality, and they need to be reviewed for potential anomalies. This involves a substantial amount of effort, so I am looking to employ the latest AI to accelerate the process. My goal here is to use both tools to identify anomalies present in the images, document results, and conclude their suitability noting any quirks along the way.

GPT-4 with Vision

I have used GPT-4 with Vision preview model. Just like GPT-3, GPT-3.5, GPT4V is a prompt-based model which means that the response it gives is very sensitive to how the prompt is constructed. GPT4V is using Chat Completion APIs that have system, user, and assistant components. More on prompt components and prompt engineering can be found here.

For testing GPT-4V I have used 340 images, 170 of them contained missing glass caps and 170 had no anomalies.

I did initial testing in Azure AI Studio where you can set system messages and upload images:

Figure 1 – Azure AI Studio

Prompt Engineering

I have gone through several iterations of engineering different prompts, they produced varied, verbose results that are difficult to process automatically. Eventually, I have settled on a specific prescriptive system message:

You are an AI assistant that is trained to identify regular patterns in images and identify irregularities that might be present. You will be presented with an image of high voltage power line showing glass insulator caps. Caps should be evenly spaced out and any gaps represent an anomaly. You will check the image and respond ‘Anomaly present’ if there are any gaps. If there are no gaps, respond ‘No anomaly found’. If not sure, respond ‘Not sure’.

I have run 2 tests, using chat completions API in Python, one with temperature (t) set to 0 and one with t = 1.

Figure 2 – Python script calling GPT4 with Vision


There was little difference in GPT-4V’s accuracy between models with temperatures set to 0 and 1. In both cases, however, the model was more accurate in recognising pictures with anomalies present.

Figure 3 – GPT4V accuracy with parameter t = 0 and t = 1


Overall, the accuracy of model was ~74% in both cases with temp = 0 and temp = 1:

Figure 4 – GPT4V accuracy with parameter t = 0 and t = 1 combined


Tweaking the temperature parameter didn’t make any noticeable difference. This isn’t surprising as in the system message I have set 3 possible responses, leaving no room for alternatives.

Custom Vision

Custom Vision on the other hand is an image recognition service where you can upload your images, tag them, and train the model to either classify an image or detect a particular object. You can find out more about this tool here.

Setting up the project can be done programmatically; however, I have opted to use the Custom Vision portal for this:

Figure 5 – Custom Vision portal

From here, you can select your project and its type:

Figure 6 – Custom Vision portal, new project

I set out to see how Custom Vision is going to perform in classifying the same 340 images compared to GPT-4V.

There are 2 types of projects you can set up, classification and object detection. First one allows you to tag whole images whilst the second one allows multiple tags and aims to find location of those tagged elements within an image. I have attempted both for detecting anomalies.

Object Detection

I have started this type of project with the anticipation that my model might be able to identify the location of a missing glass cap in the picture. Being a human, obviously, and not an AI model, I thought, how do I teach this model to recognise a faulty gap in a line of glass caps when there are many other, confusing elements in the picture? What instructions, e.g. tags, do I use? I decided that to identify a gap, it needs to recognise 4 basic things in the picture (equivalent to tags I have created):

  • Abnormal gap
  • Normal gap
  • Insulator glass cap
  • End of power line.

Then I proceeded to tag 198 images used for training with these 4 tags. This was a tedious and time-consuming task. I was determined to see the outcome of this experiment so some of the pictures ended up with lots of overlapping tags, like this one:

Figure 7 – Custom Vision tagged image

Some images were not straightforward to tag with many overlapping elements and tagging fatigue soon started to set in. This highlighted the risk of human error that can occur during the process. I was certainly getting tired of tagging endless glass caps.

I have also discovered that the accuracy of tagging depends on experience as well. I found myself unsure about some of the images, like the one below – is the area circled in red a fault or a normal connection? High-voltage power lines don’t feature in my life on a regular basis, so I wasn’t quite sure.

Figure 8 – Custom Vision tagged image with ambiguous connection

After tagging 198 images, I trained the model and ran prediction on the same 340 images used for GPT-4V. The results were messy and difficult to interpret. A single image had multiple tags assigned multiple times with tag probability. Only 2 images with anomalies had an anomaly identified with a probability of > 90%. It was clear that this model was overcomplicated.

After this, I opted for what seemed a very simplistic approach, to upload testing images into the Classification project. This type of project allows you to tag the whole image with a single tag. I decided to trust AI to learn and make sense of the images I provided.

Classification model

For training, I have used 140 images and tagged them with either ‘gap’ or ‘no gap’ tags. I then ran the training of the model a few times until precision and recall were high, both at 96.4%. From here, I used the same 340 images I fed earlier into GPT-4V for prediction.

Custom Vision did not disappoint on this occasion, the results of the prediction are shown below.


Each prediction image was classified by the model using 2 tags with attached probabilities. I have looked at the accuracy of results based on probabilities assigned to tags.

Figure 9 – Custom Vision accuracy in tags with high probabilities

The model has performed well in assigning correct tags with high probability – it has assigned correct tags in 96% or more cases.

These results are promising and could be improved by feeding more images to the model and training it. The images supplied to the model were of various quality; some were up close shots, others showed bigger sections of high-power lines with more elements in the picture:

Figure 10 – Images of high-voltage power lines

Confusion Matrix

The confusion matrix is used to assess the classification model’s performance as it shows true positives, true negatives, false positives, and false negatives produced by the model.

Figure 11 – Confusion Matrix

Whilst both GPT-4V and Custom Vision performed well in identifying gaps, the latter performed much better at identifying images containing no gaps. It was more accurate overall.

Custom Vision performance can be expected to further improve if it gets continuously trained on a more diverse set of images. It has great potential to speed up business processes when it comes to classifying unstructured visual data.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.