Integrating OpenCV Object Detection with Twilio Programmable Video

January 20, 2021
Written by
Muhammad Nasir
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by
Paul Kamp
Twilion
Phil Nash
Twilion

OpenCV Meanshift

This article is for reference only. We're not onboarding new customers to Programmable Video. Existing customers can continue to use the product until December 5, 2024.


We recommend migrating your application to the API provided by our preferred video partner, Zoom. We've prepared this migration guide to assist you in minimizing any service disruption.

Video conferencing doesn’t have to be as basic as just conveying packets of data between users. Using machine learning, we can interpret what those packets of data represent in the real world, and manipulate them in a way to create a more human-centered experience.

Today we’ll learn how to use OpenCV to do some simple object-detection with Twilio’s Programmable Video. This will allow you to add object detection to your video streams and open the pathway to many more image processing techniques using OpenCV!

Let’s get started.

Twilio Programmable Video app with OpenCV object detection

Prerequisites

Before we can build our OpenCV integration, you’ll first need a few things.

Integrate OpenCV and Twilio

First off, let’s clone Twilio’s Quickstart Video application. Open up a console and run:

git clone https://github.com/twilio/video-quickstart-js.git

Great! Now we need to initialize our Twilio application variables.

Let’s start by copying the .env.template into our own .env file.

cp .env.template .env

Now we need to initialize three variables in our .env file:

Now let’s install our dependencies.

npm install

We should be all set now to run our base application. Let’s start the app!

npm start

Initializing and Installing OpenCV

Now that we have our quickstart app working, we need to install OpenCV.

To do this you will first need to figure out the latest release from here. Download the file using the link https://docs.opencv.org/{VERSION_NUMBER}/opencv.js but substituting the latest release version.

For example, at the time of this writing the latest release is 4.5.1, so I will download https://docs.opencv.org/4.5.1/opencv.js and save it in a file called opencv.js. Copy this file to the /quickstart/public directory.

We’re going to base our tutorial on OpenCV’s Meanshift walk-through, found here.

The next step will be to add this package to one of our webpage sources. Open up quickstart/public/index.html and add this line before the closing body tag of the page:

<script src="opencv.js" type="text/javascript"></script>

And just like that, we have OpenCV installed in our application.

How OpenCV Works

Before we get into the code, it’s important to understand how OpenCV works. OpenCV provides us with functions to read from an image, manipulate that image somehow, and then draw it back. In most cases you will be binding a <video /> element with the library, and reading however many frames you want per second and drawing them back on a canvas object.

So, what you might do is read from a frame in a video such as the one below, then do some facial recognition using Haar Feature-based Cascade Classifiers. And then redraw the same frame with some boxes highling the woman’s facial features. If your video is 30 frames per second, then you need to do this 30 times a second on your canvas.

Processing a video to find a woman&#x27;s facial features
                       

(Image from OpenCV documentation)

In this tutorial, we won’t be doing facial recognition but demonstrating the concept with simpler object based detection. The tutorial will still show you the means to expand your implementation.

Coding Object Detection

We’re finally ready to code our meanshift object detection filter. First, plop this function into your quickstart/src/joinroom.js file. This algorithm was found here from OpenCV’s tutorial.

Here it is in full for you to paste:

function initOpenCV() {
  let video = document.getElementById('videoInput');
  let cap = new cv.VideoCapture(video);

// take first frame of the video
  let frame = new cv.Mat(video.height, video.width, cv.CV_8UC4);
  cap.read(frame);

// hardcode the initial location of window
  let trackWindow = new cv.Rect(150, 60, 63, 125);

// set up the ROI for tracking
  let roi = frame.roi(trackWindow);
  let hsvRoi = new cv.Mat();
  cv.cvtColor(roi, hsvRoi, cv.COLOR_RGBA2RGB);
  cv.cvtColor(hsvRoi, hsvRoi, cv.COLOR_RGB2HSV);
  let mask = new cv.Mat();
  let lowScalar = new cv.Scalar(30, 30, 0);
  let highScalar = new cv.Scalar(180, 180, 180);
  let low = new cv.Mat(hsvRoi.rows, hsvRoi.cols, hsvRoi.type(), lowScalar);
  let high = new cv.Mat(hsvRoi.rows, hsvRoi.cols, hsvRoi.type(), highScalar);
  cv.inRange(hsvRoi, low, high, mask);
  let roiHist = new cv.Mat();
  let hsvRoiVec = new cv.MatVector();
  hsvRoiVec.push_back(hsvRoi);
  cv.calcHist(hsvRoiVec, [0], mask, roiHist, [180], [0, 180]);
  cv.normalize(roiHist, roiHist, 0, 255, cv.NORM_MINMAX);

// delete useless mats.
  roi.delete(); hsvRoi.delete(); mask.delete(); low.delete(); high.delete(); hsvRoiVec.delete();

// Setup the termination criteria, either 10 iteration or move by atleast 1 pt
  let termCrit = new cv.TermCriteria(cv.TERM_CRITERIA_EPS | cv.TERM_CRITERIA_COUNT, 10, 1);

  let hsv = new cv.Mat(video.height, video.width, cv.CV_8UC3);
  let dst = new cv.Mat();
  let hsvVec = new cv.MatVector();
  hsvVec.push_back(hsv);

  const FPS = 30;
  function processVideo() {
    try {
      // if (!streaming) {
      //   // clean and stop.
      //   frame.delete(); dst.delete(); hsvVec.delete(); roiHist.delete(); hsv.delete();
      //   return;
      // }
      let begin = Date.now();

      // start processing.
      cap.read(frame);
      cv.cvtColor(frame, hsv, cv.COLOR_RGBA2RGB);
      cv.cvtColor(hsv, hsv, cv.COLOR_RGB2HSV);
      cv.calcBackProject(hsvVec, [0], roiHist, dst, [0, 180], 1);

      // Apply meanshift to get the new location
      // and it also returns number of iterations meanShift took to converge,
      // which is useless in this demo.
      [, trackWindow] = cv.meanShift(dst, trackWindow, termCrit);

      // Draw it on image
      let [x, y, w, h] = [trackWindow.x, trackWindow.y, trackWindow.width, trackWindow.height];
      cv.rectangle(frame, new cv.Point(x, y), new cv.Point(x+w, y+h), [255, 0, 0, 255], 2);
      cv.imshow('canvasOutput', frame);

      // schedule the next one.
      let delay = 1000/FPS - (Date.now() - begin);
      openCVInterval = setTimeout(processVideo, delay);
    } catch (err) {
      console.log(err);
    }
  };

  // schedule the first one.
  openCVInterval = setTimeout(processVideo, 0); //Clear to short circuit
}

In the above block of code, here’s what’s happening:

  1. Setup our OpenCV instance with our Twilio video stream as an input.
  2. Take the first frame of the video.
  3. Create our Region of Interest Histogram, other scalar matrices, and so on.

Now we enter a loop that runs 30 times every second. Each time we enter the loop:

  1. Take a frame from the video.
  2. Use OpenCV’s meanshift algorithm to calculate the position of the moving object.
  3. Draw a rectangle around said position.
  4. Output to canvas object.

In this function, you can work on the algorithm and tweak it to match your own use case. There are tons of examples on the internet and algorithms that you can mostly just copy and paste right into your code.

Scheduling frame processing

Notice that since OpenCV works on a frame per frame basis, we schedule the next frame using setTimeout() when we’re done with one frame. In order to short circuit the processing, we save the result from the setTimeout() to openCVInterval so we can clear it later inside the OpenCV processing. Go back to see where it is declared.

Now we need to declare this variable on the top of the quickstart/src/joinroom.js file.

let openCVInterval = null;

At the end of the setActiveParticipant function we will add these lines of code to short-circuit any previous invocation of initOpenCV and invoke a new thread to process the new participant’s video.

  if (openCVInterval) {
    clearInterval(openCVInterval)
  }
  setTimeout(initOpenCV, 5000);

The timeout of 5 seconds is overkill but is required. There’s a slight delay between when the participantConnected event fires, which lets our application know that a new participant has joined, and actually rendering their video on screen.

The idea is that we wait for the video to render on the screen before we start to process it, otherwise OpenCV throws errors since it sees an empty video element. In a real application we might have a button or something that will trigger the OpenCV processing so this delay will not be necessary.

UI Styling And Final Touches

In your quickstart/public/index.html file, look at this part of the DOM:

<div class="participant main">
  <video autoplay playsinline muted></video>
</div>

We are going to change it to this:

<div class="participant main">
  <video id="videoInput" width="320" height="240" autoplay playsinline muted></video>
  <canvas id="canvasOutput" width="320" height="240"></canvas>
</div>

We did two things of importance here. We created our canvas object and set it and the video container to an equal width and height ratio.

Finally, add these styles to the quickstart/public/index.css file.

#canvasOutput {
 position: absolute;
 width: 96.5%;
 height: 720px;
}

#videoInput {
 object-fit: fill;
}

Run and Test OpenCV

Great work – you’re now ready to check everything is working. Run the app using:

npm start

Now when you join a room you should see a moving red rectangle around an object you put in frame! Here’s a demo:

Integrating OpenCV with Twilio Programmable Video

There you go – now you have some basic object detection in your Programmable Video app! You’ll now be able to use OpenCV to understand more – programmatically – about what a video stream is depicting, track moving objects, recognize facial expressions, etc. You’ll definitely be able to build cool stuff around that concept.

Now that you have OpenCV and Twilio working together, check out our Video blog posts for more ideas on how to develop your app. If you already know what you’re building, our Programmable Video docs have what you need.

We can’t wait to see what you build.

Muhammad Nasir is a Software Developer. He's currently working with Webrtc.ventures. He can be reached at muhammad [at] webrtc.ventures.