In this tutorial we will use OpenCV to combine a YOLOv3 detector with a tracking system to identify and track among 80 object classes on video. To follow along this tutorial you will need a video recording of your own. Code and further instructions are available in a dedicated repository. Lights, camera, action
Note: Apparently some browsers display the code without indentation. For better readability I recommend using Chrome or Firefox.
Computer vision is practically everywhere – summoned whenever you unlock your phone, check-in at the airport or drive an autonomous vehicle. In industry, it is revolutionising fields ranging from precision agriculture to AI-assisted medical imaging. Many such applications are based on object detection, one of the key topics of this tutorial and to which we will turn our attention next.
We have seen how convolutional neural networks (CNNs) can be used for image classification. In this setting, the CNN classifier returns a fixed number of class probabilities per input image. Object detection, on the other hand, attempts to identify and locate any number of class instances by extending CNN classification to a variable number of region proposals, such as those captured by bounding boxes.
Object detectors form two major groups – one-stage and two-stage detectors. One-stage detectors, such as You Only Look Once (YOLO)1 are based on a single CNN, whereas two-stage detectors such as Faster R-CNN2 decouple region proposal and object detection into two separate CNN modules. One-stage detectors are generally faster though less accurate than their two-stage counterparts. Let us now briefly introduce YOLO.
The YOLO detector was first developed in 2015 using the Darknet framework, and since then various updates came out. As illustrated below, YOLO leverages the CNN receptive field to divide the image into a S x S grid. For each cell in the grid,
- it estimates the centre (x, y), size (w, h) and objectness score for each of B bounding boxes per cell (bounding boxes + confidence)
- it emits the probabilities of all C object classes (class probability map)
For a given input image, this large search space yields a three-dimensional tensor of size S x S x (B x 5 + C). Arriving at the final detections requires the filtering of high-confidence predictions, followed by non-maximum suppression (NMS) to keep those that meet a certain maximum overlap threshold.
In this tutorial we will use YOLOv33, the 2018 model update with the architecture represented below, inspired by feature pyramid networks. This particular version extends object detection to three different scales – owing to the introduction of residual blocks – each of which responsible for predicting three bounding boxes per cell. The model takes RGB images with 416 x 416 resolution as input and returns three tensors of size S x S x (15 + C), one per detection scale, where S is one of 52, 26 or 13. Furthermore, the model is trained to minimise the error between the bounding box coordinates (regression), class probabilities (multi-label classification) and objectness scores (logistic regression) of observed and predicted boxes.
The YOLOv3 detector was originally trained with the Common Objects in Context (COCO) dataset, a large object detection, segmentation and captioning compendium released by Microsoft in 20144. The dataset features a total of 80 object classes YOLOv3 learned to identify and locate. To give a perspective of their diversity, here is a graphical representation of a random sample.
The resulting detector enjoyed so much success that following its release, it became widely used for inference based on the COCO classes and transfer learning to solve different detection problems.
At a processing rate of ~35 FPS, one of the tasks this detector succeeds the most is object detection on video. However, detection in successive frames is computationally intensive and oblivious to transitions between successive predictions, and may furthermore fail due to problems of occlusion or change in appearance. In this context, devising a framework that alternates between object detection and tracking can alleviate these issues.
Following object detection, various methods, including MIL, KCF, CSRT, GOTURN and Median Flow can be used to carry out object tracking. For tracking of multiple objects using any such method, OpenCV supplies multi-tracker objects to carry out frame-to-frame tracking of a set of bounding boxes until further action or failure.
For the purpose of this tutorial we will use Median Flow, a simple, fast and scalable tracking method that works best provided there is little to no occlusion5. Under the hood, Median Flow initialises points inside a bounding box, tracks the points using the Lucas-Kanade algorithm, estimates the forward-backward tracking error, discards 50% of the outliers and updates the bounding box coordinates using the median vector of the consistent trajectories. The process is then repeated over a sequence of frames. Here is an insightful, interactive visualisation of Median Flow in action.
Having introduced this much, you should now be able to follow along the different steps we will take next. I nonetheless highly encourage reading more about YOLO and Median Flow. If you are ready, have a coffee and get ready to code
Let’s get started with Python
Prior to Python coding we need to set up a few things. After creating a MOV video recording, for example using an iPhone, move it to your working directory. With formats other than MOV you will need to make the necessary changes to the code below. Then, simply run a full workspace setup with the terminal command
./init.sh <PATH_TO_MOV>. Let us have a closer look into what this Bash script does.
First, it creates the subdirectories
output/ which will contain the YOLOv3 dependencies, the input video and the output video, respectively.
Second, it converts your MOV video file to MP4 using FFmpeg. This conversion will generate
input/input.mp4 using the following options:
-vcodec h264, to select a MPEG-4 codec used for MP4 conversion
-vf scale=720:-2,setsar=1:1, to resize the output video to 720p and preserve both display and sample aspect ratios
-an, to discard the audio channel, since we do not need it
Third, it downloads three small text files that together provide all 80 COCO class labels, the network configuration and the network weights from training with the COCO dataset – these are, in respective order,
Model loading and configuration
Switching to Python, we import the few modules needed and set three inference parameters in advance – the thresholds for objectness score, object class probabilities and NMS overlap. It is also advisable to seed the analysis, if for example you set to compare different configurations.
Next, we load the 80 COCO class labels and assign them each a random colour. This will enable the identification of the corresponding class probabilities in the YOLOv3 output tensors, and facilitate the distinction among different object types in the output video. Then, we load YOLOv3 by passing the configuration and weight files to
cv2.dnn.readNetFromDarknet(), and extract the output layer names to more easily access predictions during inference.
#%% Load YOLOv3 COCO weights, configs and class IDs
# Import class names
with open('yolov3/coco.names', 'rt') as f:
classes = f.read().rstrip('\n').split('\n')
colors = np.random.randint(0, 255, (len(classes), 3))
# Give the configuration and weight files for the model and load the network using them
cfg = 'yolov3/yolov3.cfg'
weights = 'yolov3/yolov3.weights'
# Load model
model = cv2.dnn.readNetFromDarknet(cfg, weights)
# Extract names from output layers
layersNames = model.getLayerNames()
outputNames = [layersNames[i – 1] for i in model.getUnconnectedOutLayers()]
#%% Define function to extract object coordinates if successful in detection
def where_is_it(frame, outputs):
frame_h = frame.shape
frame_w = frame.shape
bboxes, probs, class_ids = , , 
for preds in outputs: # different detection scales
hits = np.any(preds[:, 5:] > P_THRESH, axis=1) & (preds[:, 4] > OBJ_THRESH)
# Save prob and bbox coordinates if both objectness and probability pass respective thresholds
for i in np.where(hits):
pred = preds[i, :]
center_x = int(pred * frame_w)
center_y = int(pred * frame_h)
width = int(pred * frame_w)
height = int(pred * frame_h)
left = int(center_x – width / 2)
top = int(center_y – height / 2)
# Append all info
bboxes.append([left, top, width, height])
return bboxes, probs, class_ids
At last we have all pieces in place to begin the processing of the MP4 input video. Here is mine for reference, showing my living room and featuring a famous cat
Ahead of processing, we must set up both video capture and writing – the latter conforming to the same FPS rate, width and height of the former. Together, these two OpenCV utilities enable looping over one frame at a time, running detection or tracking accordingly and storing it with the overlaid results in
#%% Load video capture and init VideoWriter
vid = cv2.VideoCapture('input/input.mp4')
vid_w, vid_h = int(vid.get(3)), int(vid.get(4))
out = cv2.VideoWriter('output/output.mp4', cv2.VideoWriter_fourcc(*'mp4v'),
vid.get(cv2.CAP_PROP_FPS), (vid_w, vid_h))
# Check if capture started successfully
Now, assuming you have a basic knowledge of Python, I will summarise how processing unfolds.
We will perform detection every 60 frames and object tracking in between. If no high-confidence boxes are predicted we repeat detection in the next frame; likewise, if tracking fails we switch back to detection. The processing of the input video will be monitored in real-time using a
cv2.namedWindow() instance. As long as the video capture is open and feeding frames, we check whether detection or tracking should take place and proceed accordingly:
- For detection, we first pass the current frame to the loaded YOLOv3 model after appropriate preprocessing. Preprocessing comprises scaling pixel intensities to the 0-1 range, resizing the input frame to 416 x 416 and reordering the BGR channels to RGB. Next, a forward pass with the preprocessed frame outputs the model predictions, from which we filter high-confidence boxes using the custom function
where_is_it(). Lastly, any filtered boxes are subjected to NMS and the resulting final detections, along with the current frame, are used to create a multi-tracker object. If otherwise no boxes are returned, a red message indicating detection failed is printed on the top-right corner of the frame.
- For tracking, we pass the current frame to the existing multi-tracker object. If tracking is successful, we extract the new box coordinates with which to draw rectangles around the previously detected objects and print the corresponding class labels, on the current frame.
As a result, tracked objects will be highlighted in successive frames, and these in turn will be added to the output video file. Once the capture is exhausted, we release the output writer. Note that
cv2.waitKey() allows for breaking the loop by pressing ESC – this can be helpful for debugging.
|#%% Initiate processing|
|# Init count|
|count = 0|
|# Create new window|
|# Perform detection every 60 frames|
|perform_detection = count % 60 == 0|
|ok, frame = vid.read()|
|if perform_detection: # perform detection|
|blob = cv2.dnn.blobFromImage(frame, 1 / 255, (416, 416), [0,0,0], 1, crop=False)|
|# Pass blob to model|
|# Execute forward pass|
|outputs = model.forward(outputNames)|
|bboxes, probs, class_ids = where_is_it(frame, outputs)|
|if len(bboxes) > 0:|
|# Init multitracker|
|mtracker = cv2.MultiTracker_create()|
|# Apply non-max suppression and pass boxes to the multitracker|
|idxs = cv2.dnn.NMSBoxes(bboxes, probs, P_THRESH, NMS_THRESH)|
|for i in idxs:|
|bbox = [int(v) for v in bboxes[i]]|
|x, y, w, h = bbox|
|# Use median flow|
|mtracker.add(cv2.TrackerMedianFlow_create(), frame, (x, y, w, h))|
|# Increase counter|
|count += 1|
|else: # declare failure|
|cv2.putText(frame, 'Detection failed', (20, 80),|
|cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0,0,255), 2)|
|else: # perform tracking|
|is_tracking, bboxes = mtracker.update(frame)|
|for i, bbox in enumerate(bboxes):|
|x, y, w, h = [int(val) for val in bbox]|
|class_id = classes[class_ids[idxs[i]]]|
|col = [int(c) for c in colors[class_ids[idxs[i]], :]]|
|# Mark tracking frame with corresponding color, write class name on top|
|cv2.rectangle(frame, (x, y), (x+w, y+h), col, 2)|
|cv2.putText(frame, class_id, (x, y – 15),|
|cv2.FONT_HERSHEY_SIMPLEX, 0.5, col, 2)|
|# Increase counter|
|count += 1|
|# If tracking fails, reset count to trigger detection|
|count = 0|
|# Display the resulting frame|
|# Press ESC to exit|
|if cv2.waitKey(25) & 0xFF == 27:|
|# Break if capture read does not work|
|print('Exhausted video capture.')|
Here is my output video
In this tutorial we built an OpenCV-based framework with which to identify and track objects on video. We have seen how most highlighted objects – however few in number – were accurately identified and tracked over successive frames. Here are some suggestions to improve on this prototype:
- Tweak the objectness score, class probability and NMS thresholds. There is nothing special about the thresholds set at the beginning of this exercise. Experiment with these and work out your precision-recall ‘sweet spot’ for the first two – raising either threshold will lead to lower recall and higher precision, and vice versa. The rate of detection per frame too is just as easily adjustable
- Do not take Median Flow for a walk. Object tracking with Median Flow worked well because there was no movement in my living room; testing this framework in a more lively scene would almost certainly fail. In that case, alternative methods that cope better with occlusion such as KCF and CSRT will render tracking more stable at expense of more computation
- Go find the state-of-the-art. Believe it or not, the techniques presented here aged quite rapidly. This is particularly obvious for YOLOv3, as object detection has been advancing fast in recent years; in fact, the performance of convolutional methods is now rivalled by vision transformers, which are inherently capable of multimodal self-supervised learning
- Try out processing from a live stream. With minor changes to the code above you can perform live detection and tracking, for example using a webcam. Live processing might motivate you to create a lighter framework to boost the FPS rate
- Keep a modest video resolution. Keeping a larger input video resolution could be thought to improve predictions, alas it does not. Frames are invariably resized to 416 x 416 before inference, thus preserving a higher resolution will at best render a higher resolution output video. Differences in the contribution of bicubic (FFmpeg) and bilinear (OpenCV) interpolations to downscaling should be unimportant
- Up your game with transfer learning. For a more advanced object detection tutorial, transfer learning with YOLO might be the perfect fit. This requires a convolutional backbone, a list of custom class names, a modified network configuration template and ground-truth boxes and labels. With a line of code you can train the model to solve your own detection problem. If you are interested, check out the two Colab notebooks I wrote outlining the fine-tuning of YOLOv3 and YOLOv4, to identify bare and mask-wearing faces
I hope you had fun implementing object detection and tracking to explore our surroundings. Please leave your comments below, I always appreciate some feedback. Until next time!