【Computer Vision(1)】What makes the computer vision tasks hard?

2023年2月20日 12:55

In my last note post, we mentioned about our product disciplines, and how we defined UIUX in AquaAge term as follows.

In this post, we want to briefly discuss the tasks in Computer Vision (CV), and the insights of what makes the CV task to be hard, and what kinds of methods to solve the CV task.

What makes the CV task to be hard

In order to understand the limitations of CV, it's important to understand what makes it so hard, because anything with easy or deep learning is essentially even though it seems easy for us. It's not easy for an algorithms or computer or model.

Knowledge of objects

Human brains are amazing at vision and so many other tasks like speech, reading comprehension, strategic thinking (well, some people at least), but generally everyone can do vision quite well. And that's because we have a very well-developed visual cortex. It beats robots any day, at least currently right now. And there's a reason for it, that is because our eyes are very good ad general purpose, computer vision. There's a lot of information being fed to our eyes and we just naturally understand it. However, it isn't so for computers, they have to basically understand every object, and robots and computers don't have background knowledge of objects.

Visual Cortex: involved in awareness, processing, analyzing, and recognition of visual stimuli.[1]

Physical camera limitations

Moreover, physical cameras often have limitations with noise, with granularity or resolution issues. Things far away if it's teeny tiny camera is used, what you can get from the camera will be quite blurry. Especially, when we detect facial details from images, the blurry smartphone image will lead the bad skin image analysis result. In AquaAge Inc., we did several analysis on skin images, that makes the requirement of camera is critical on the results.

Liberty at Angles

There's those limitations then viewpoint variations. We naturally understand that this is the Statue of Liberty at different angles, or this is that's just rotated, but a computer vision model might not. Human have trained to understand or inferred the objects from different angles, however computer still need to model the object from all perspectives.

Changes on lighting conditions

You can also find detecting objects from images with different lighting conditions is also hard for computers, sometimes even for human. For some scenarios we need to consider the variation on lighting conditions to increase the detection accuracy.

ResearchGate Incremental changes of lighting conditions in the Oxford10k dataset[2]

Scaling issue

You can see how drastic this like being here or it's like being not there changes the scene, then there's scaling issues. So you can see the Taj Mahal at this level, at this scale looks completely different, at least to a computer vision model. It might hopefully not. But you can see when you scale back out how different it looks as well as these comparisons here.

Deformations

There's natural non-rich deformations, like a dog has many different poses similar to a human. Dogs can be sitting, standing, crouched. So there's this that to consider that definition that an object like this can do. So you're going to have to use other characteristics to identify it, which we naturally learn in our brains. But a computer vision algorithm has to be fed many different instances of a dog or a horse to understand that it's not. The fact that the animal is standing this way makes it a dog. It's because of its fur or it's facial shape. Or is that's what makes it the dog.

Deformed dogs, human can understand it, however computers are hard to do this.

Occlusion

There's also the occlusion where basically which means that part of the object is blocked by another object. Human brain can complement the remaining part to infer the object type, however computer do not good at those tasks.

How To Detect Partially-Occluded Objects Using Temporal Context[4]

Clutter

So this isn't technically a form of clutter. It's not camouflage, but in a way it is clutter because it's a scene where it's hard to detect the octopus in next. This is a scene somewhere in crowed street possibly that just shows you how many clutter, how many different clutter, poses and large variations of appearance of upper and lower body.

Optical illusion

You can see the following figure actually is a flat 2D image. However, it does look like a truly 3d objects.

Is this a fizz or two faces? Who knows?

So as you can see, there's a number of ways we can trick vision systems. And actually there's a whole field of complete division, antagonistic type of training where you create models to beat other models effectively.

So it's quite hard to make out anything in this picture for us for computer vision algorithm, trying to make sense of it.
Basically, we've trained CV models to do nice things, such as medical image, face recognition and so on.

Reference

[1]

[2]

https://www.researchgate.net/publication/321962804_Incremental_Adversarial_Domain_Adaptation/figures?lo=1

[3]

[4]

[5]

https://www.researchgate.net/publication/286176572_Switchable_Deep_Network_for_Pedestrian_Detection

[6]
https://www.youtube.com/watch?v=gJLoLgFvA4o