discussion / AI for Conservation  / 31 October 2024

What AI object detectors really see

After processing millions of images and examing countless false positives a number of these provide interesting insights as to what computer vision is really learning. The following example is a classic.
 

False positive


In the beginning, I saw a lot of these, as I used better models they incidence dropped off somewhat. The problem is the spider web streak (Also happens with rain streaks). If you look you can start see what's going on, the model has learnt the a human is a somewhat vertical thing with a lump on the top and often angled bits sticking out the side. When the rain leave an angular streak on the image, it sees this as an arm. In this case the confidence score is not so high but I have seen ridiculous ones with very high confidence scores on good models as well many times.

The training process is lazy, it only learns as much as is necessary to pass the tests based on the training data. In this case if the features it learns mostly are that people are verticial things with lumps in different places, like on top for instance, with bits ticking out the side and it's able to pass with a high score with that then it's job done. This points to a way to improve the training, we need to penalize it for that sort of poor decision, something I will do in the future for my training sets it is put a lot of composed images made out of vertically segmented things like bags and angular things down the side, like broom handles. This should force the training process to look for more defining features.

Now I should point out, that this model is a very good one, yolov6, large model with around 59 million parameters. A lot of the on the edge models have just around the 1 million parameters mark. The number of parameters represents in principle the amount of memory it has to "potentially" learn new defining features, so long as it has it to improve it's score.

So if you are looking for deer and there are other animals that potentially good be seen as deer make sure you include that sort in the training data as well so it learns to tell the difference.




Lars Holst Hansen
@Lars_Holst_Hansen
Aarhus University
Biologist and Research Technician working with ecosystem monitoring and research at Zackenberg Research Station in Greenland
Conversation starter level 3
Popular level 3
Poster level 2
Involvement level 3
Commenter level 4

Interesting observation, Kim!

I finally located the reference I talked about the other day at the Variety Hours After Hours:

The post is covering the following paper: 

The paper examines how deep convolutional neural networks (DCNNs) recognize objects, focusing on their reliance on texture rather than global shape. The key points include:

  • Human vs. AI Recognition: Humans use global shape for recognition, while DCNNs rely more on local features.
  • Experimental Findings: DCNNs performed well with local contours but struggled with animal recognition when global shapes were disrupted.
  • Conclusion: DCNNs do not effectively utilize global shape information, highlighting a critical difference between artificial and human vision systems.

This figure from the paper illustrates that we as humans can quickly detect the left as a bear but not so quickly the right:

simplified bear shape and scrambled image of bear

The paper is admittedly quite old (2018) and the DCNNs were even older (AlexNet 2012 and VGG-19 2014). 

It would be really interesting to find out how modern models like the modern YOLO versions would behave in similar experiments.

Cheers, Lars

Rob Appleby
@Rob_Appleby  | He/him
Wild Spy
Whilst I love everything about WILDLABS and the conservation tech community I am mostly here for the badges!!
WILDLABS Author
WILDLABS Research Participant
Variety Hour Regular
Commenter level 4
Conversation starter level 3

I am certainly a somewhat vertical thing with a lump on top, so I get it...