Teaching a Computer to see

Explaining and using Object Detection with YOLO (You Only Look Once) Computer Vision Algorithm

đź’ˇ Posted on November 2nd 2019, Revised on October 14th 2020

We all take our vision for granted. It is a product of hundreds of million years of evolution, evolving from a light-sensitive patch of cells to a complex optical extravaganza called the human eye.

Throughout Humanity’s course of existence, our survival has depended on our ability to understand and interpret our environment. During our time roaming the African plains, primitive humans needed to immediately recognize and react upon threats.

Even today, this still remains to be the case.

The ability to see the world we live in has allowed Christopher Columbus to navigate the world, Leonardo da Vinci to paint Mona Lisa and little Timmy here to ride a unicycle. Great job Timmy.

Since the emergence of Homo Sapiens, humans have had the best sense of sight. However, this won't be the case for long. Because humans are very lazy. First Humans thought:

“Hey walking is tiring, let’s domesticate these big quadrupedal animals and ride them around instead.”

Then we thought:

“Hey feeding and maintaining these horses is too much effort, let’s build these big four wheel vehicles to move around in”

And now we think:

“Hey driving these Cars is too much effort, let’s build complicated AI Algorithms to do that for us”

And thus autonomous cars were created- born out of our laziness to even drive a car.

But how exactly did we manage to create a car that could drive itself? Simply hooking up a computer with some cameras to a car isn’t enough, because computers don’t see images in the same way that we do. Take a look:

To us, we see elaborate shapes and edges that our big brains can piece together to make up a face. But to a computer, they see an array of many many numbers.

For so long, humans never knew how they could teach computers to see. Just think about it for a second:

If you were given a grid full of numbers, what exactly would you do with it? How could you see through this?

Not to mention, the picture of Abraham Lincon that I used above is in black in white. Normal colored images are in RGB (Red Green Blue) and have 3 separate color channels rather than 1, meaning that images are not just a 2 dimension array of numbers but a 3-dimensional array. How does a computer recognize patterns in these numbers to see objects? Simply having a bunch of if-else statements isn’t going to cut it this time.

Enter Neural Networks.

Neural Networks are inspired by the way the human brain works, consisting of many layers of interconnected neurons that work together. After all, if our brain can learn to see, why cant an artificial brain do the same? Neural Networks allow computers to teach itself to recognize complex patterns necessary for the specific task it wishes to do. In this case, the goal is to teach a computer to see and understand the environment that it is in.

The best way to teach a car to see is to use a special type of neural network called Convolutional Neural Networks.

Convolutional Neural Networks (CNNs) are used in Computer Vision because of their amazing ability to understand spacial information. This means if I had a picture of a human, even if I rotate it, move it around, squeeze and stretch it, the CNN would still recognize that it is a human.

The key behind the power of CNNs is that they use special layers called convolutional layers that extract features from the image. Initial Convolutional Layers recognize low-level features like edges and corners. As the features progress through more Convolutional Layers, the detected features become far higher in complexity. For example, the convolutional layers used to detect people may go from edges to shapes to limbs to people as a whole. You can read more about Convolutional Neural Networks in this article.

The architecture of Convolutional Neural Networks

Convolutional Neural Networks by themselves are mainly used in image classification: given an image, the network will accurately assign a given class. For example, using a CNN trained on a skin lesion dataset, it can then learn to diagnose different skin cancers from a given image.

But what If I wanted to know more than just if an image belongs to a class, more specifically, where exactly the object is located? Additionally, what if there are multiple objects within an image that are different classes? A self-driving car can’t just know that there are 5 cars and 20 people in the area, it needs to know where they are relative to itself in order to navigate safely.

This is where object detection comes in. By spicing up our Convolutional Neural Network, we can repurpose its amazing classification properties to also locate where the classes are in the image. We can do so through an algorithm called YOLO (You Only Look Once) which can perform real-time object detection, perfect for autonomous vehicles. YOLO is incredibly fast, uses 24 convolutional layers, and can process up to 155 frames per second. This makes it easily implementable into a self-driving car. So how does it work?

#YOLO Explained

YOLO, as the research paper describes, has :

A single convolutional neural network simultaneously predicts multiple bounding boxes and class probabilities for those boxes

YOLO uses features from the entire image to predict each bounding box and their classes which it does simultaneously. Similar to humans, YOLO can pretty much immediately recognize where and what objects are within a given image.

When running on an image, YOLO first divides the image into an S by S grid.

Within each grid cell, YOLO will predict the locations, sizes, and confidence scores of the predetermined number of bounding boxes essentially predicting the class and potential place where an object can be. If the center of an object falls into the grid cell, then the bounding boxes of that grid cell are responsible for accurately locating and predicting that object.

YOLO bounding boxes in action

Each bounding box will have 5 predictions: x coordinate, y coordinate, width, height, and confidence. The calculated confidence score indicates how confident the model thinks there is a class within the bounding box, and how accurate it thinks that class fits in the box which uses a metric called Intersection over Union.

Intersection over Union is used in object detection because it compares the ground truth bounding box with the predicted bounding box. By dividing the area of overlap with the area of the union, we have a function that rewards heavy overlapping and penalizes inaccurate bounding box predictions. The bounding box’s goal is to ultimately confine the object within the image as accurately as possible and IoU is a great metric for determining this.

After an image is run through YOLO, it outputs predictions in an S x S x (B * 5 + C) tensor where each grid cell predicts the location and confidence scores of B amount of bounding boxes across C amount of classes. Ultimately, we end up with a lot of bounding boxes- most of which are irrelevant.

To filter out the correct boxes, the bounding boxes with a predicted class that meets a certain confidence score will then be kept. This allows us to isolate all relevant objects within the image.

In essence, YOLO locates and classifies objects within an image/video. Object Detection algorithms like YOLO, combined with the many other sensors on a self-driving car like Li-Dar, allow us to build fully autonomous cars that can drive faster, safer, and better than any human can. If you are interested in diving deeper into self-driving cars, I highly recommend reading this article.

Now that you understand what YOLO does, let’s see it in action.

Running YOLO

To try out YOLO on your own computer, clone my GitHub repository to your computer.

Then download the pre-trained weights from this link and move the weights into the folder titled “yolo-coco”

Within your command terminal, locate the cloned repository, import required dependencies and libraries, and enter the following command to run YOLO on any video.

python yolo_video.py — input <INPUT PATH> — output <OUTPUT PATH> — yolo yolo-coco

After running the command, YOLO will run on the INPUT video and the final product will be found in the OUTPUT PATH. Here is something that I ran through YOLO:

There you go! You understood and ran an object detection algorithm! It’s incredible how close we are to a world dominated by AI drivers.