ESP32 cam Object Detection

ESP32 cam Object Detection cover

Object detection is the task of detecting an object of interest inside an image. Until a couple years ago, this task was exclusive matter of computers due to the complexity of models and the prohibitive number of math operations to perform.

Thanks to platforms like Edge Impulse, however, the entry barrier for beginners has become much lower and it is now possible to:

  1. easily train an object detection model in the cloud
  2. (not so easily) deploy this model to the Esp32 camera

Sadly, the Esp32 camera seems not to be a first-class citizen on the Edge Impulse platform and the (few) documentation and tutorials available online miss a lot of fundamental pieces.

The purpose of this post is to help you develop and deploy your very own object detection model to your Esp32 camera with detailed, easy to follow steps, even if you're a beginner with Edge Impulse or Arduino programming.

Let's start!

Hardware requirements

This project works with S3 and non-S3 boards. An S3 board is highly recommended though.

Software requirements

EloquentEsp32Cam >= 2.2

You will need a free account on Edge Impulse.

If you've never used Edge Impulse, I suggest you watch a couple video tutorials online before we start, otherwise you may get lost later on (there exists an official tutorial playlist on YouTube).


Let's start with the end result.

Object Detection using ESP32(S3) Camera Quickstart

To perform object detection on our ESP32 camera board, we will follow these steps:

  1. Collect images from the camera
  2. Use Edge Impulse to label the images
  3. Use Edge Impulse to train the model
  4. Use Edge Impulse to export the model into an Arduino library
  5. Use EloquentEsp32Cam to run the model

Before this post existed, steps 1. and 5. were not as easy as they should've been.

I invite you to stop reading this post for a minute and go search on Google Esp32-cam object detection: read the first 4-5 tutorials on the list and tell me if you're able to deploy such a project.

I was not.

And I bet you neither.

Enough words, it's time to start tinkering.

CAUTION: there are 8 steps to follow from start to finish in this tutorial. Some of them only take 1 minute to execute, so don't get scared. But remember that it is crucial for you to follow them one by one, in the exact order they're written!

Collect images from the ESP32 camera

This is the first part where things are weird in the other tutorials. Many of them suggest that you use images from Google/internet to train your model. Some others suggests to collect data using your smartphone. How using random images from the web or from a 40 MP camera can help train a good model that will run on a 2$ camera hardware is out of my knowledge.

That said, I believe that to get good results, we need to collect images directly from our own Esp32 camera board.

This used to be hard.

But now, thanks to the tools I created for you, you will complete this task in a matter of minutes.

[1/8] Upload the "Collect_Images_for_EdgeImpulse" Sketch to your Board

After you installed the EloquentEsp32Cam library, navigate to File > Examples > EloquentEsp32Cam > Collect_Images_for_EdgeImpulse and upload the sketch to your board.

collect-images-for-fomo-example.png 779.61 KB
In case you want to know what's inside, here's the code.

See source

Filename: Collect_Images_for_EdgeImpulse.ino

/**
 * Collect images for Edge Impulse image
 * classification / object detection
 *
 * BE SURE TO SET "TOOLS > CORE DEBUG LEVEL = INFO"
 * to turn on debug messages
 */

// if you define WIFI_SSID and WIFI_PASS before importing the library, 
// you can call connect() instead of connect(ssid, pass)
//
// If you set HOSTNAME and your router supports mDNS, you can access
// the camera at http://{HOSTNAME}.local

#define WIFI_SSID "SSID"
#define WIFI_PASS "PASSWORD"
#define HOSTNAME "esp32cam"


#include <eloquent_esp32cam.h>
#include <eloquent_esp32cam/extra/esp32/wifi/sta.h>
#include <eloquent_esp32cam/viz/image_collection.h>

using eloq::camera;
using eloq::wifi;
using eloq::viz::collectionServer;


void setup() {
    delay(3000);
    Serial.begin(115200);
    Serial.println("___IMAGE COLLECTION SERVER___");

    // camera settings
    // replace with your own model!
    camera.pinout.wroom_s3();
    camera.brownout.disable();
    // Edge Impulse models work on square images
    // face resolution is 240x240
    camera.resolution.face();
    camera.quality.high();

    // init camera
    while (!camera.begin().isOk())
        Serial.println(camera.exception.toString());

    // connect to WiFi
    while (!wifi.connect().isOk())
      Serial.println(wifi.exception.toString());

    // init face detection http server
    while (!collectionServer.begin().isOk())
        Serial.println(collectionServer.exception.toString());

    Serial.println("Camera OK");
    Serial.println("WiFi OK");
    Serial.println("Image Collection Server OK");
    Serial.println(collectionServer.address());
}


void loop() {
    // server runs in a separate thread, no need to do anything here
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

Don't forget to replace SSID and PASSWORD with your WiFi credentials!

Don't forget to replace your own camera model, if different from Ai Thinker.

Once the upload is done, open the Serial Monitor and take note of the IP address of the camera.

Open a browser and navigate either to http://esp32cam.local or the IP address of the camera.

[2/8] Prepare your environment

The OV2640 sensor that comes with your ESP32 camera is pretty good for its price, really. Nevertheless, it still remains a 1 $ 2MP camera, so you can't expect great results in every scenario.

Do a favor to yourself and put some efforts into creating a proper shooting environment:

  1. use proper illumination (either bright sun or artificial light)
  2. de-clutter your environment (use a desk and a flat, monochrome background - e.g. a wall)

[3/8] Collect images in a web browser

Back to the webpage at http://esp32cam.local.

To collect the images, you simply need to click Start collecting. To stop, click Stop.

Please follow these EXACT steps to collect a good quality dataset (that guarantees you won't have troubles later on):

  1. put nothing in front of the camera. Collect 15-20 images. Pause, download and clear
  2. put the first object in front of the camera. Collect 20-30 images while moving the camera a bit around. Try to capture different angles and positions of the object. Pause, download and clear
  3. repeat 2. for each object you have

After you finish, you will have one zip of images for each object class + one for "no object / background".

Extract the zips and move to the next step.

collect-images.png 756.27 KB

[4/8] Use Edge Impulse to Label the Images

For object detection to work, we need to label the objects we want to recognize.

There are a few tools online, but Edge Impulse has one integrated that works good enough for our project. Register a free account on edgeimpulse.com if you don't have one already.

Create a new project, name it something like esp32-cam-object-detection, then choose Images > Classify multiple objects > Import existing data.

Now follow these EXACT steps to speed up the labelling:

  1. click Select files and select all the images in the "no object / background" folder; check Automatically split between training and testing; then hit Begin upload
  2. close the modal and click Labelling queue in the bar on top
  3. always click Save labels without doing anything! (since there's no object in these images, we'll use them for background modelling)
  4. once done for all the images, go back to Upload data in the bar on top
  5. click Select files and select all the images in the first object folder. Only upload the images of a single folder at a time!!
  6. Check Automatically split between training and testing; then hit Begin upload
  7. go to Labelling queue in the top bar and draw the box around the object you want to recognize. On the right, make sure Label suggestions: Track objects between frames is selected
  8. label all the images. Make sure to fix the bounding box to fit the object while leaving a few pixels of padding
  9. repeat 4-7 for each object

If you upload all the images at once, the labelling queue will mix the different objects and you will lose a lot more time to draw the bounding box.

Be smart!

At this point, you will have all the data you need to train the model.

penguin-bbox.png 706.86 KB

[5/8] Use Edge Impulse to Train the Model

If you've ever used Edge Impulse, you know this part is going to be pretty easy.

  1. Navigate to Impulse design on the left menu
  2. enter 48 as both image width and image height
  3. select Squash as resize mode
  4. Add the Image processing block
  5. Add the Object detection learning block
  6. save the impulse

48 is a value that I found working pretty well with non-S3 cameras: it generates a model small enough to fit in memory yet produces usable results. If you have an S3 board, you can change this value to 60 or even 96.

create-impulse.png 853.87 KB

  1. navigate to Impulse design > Image on the left
  2. select Grayscale as color depth
  3. click on Save parameters
  4. click on Generate features

It should take less than a minute to complete, depending on the number of images.

If you only have a single class of object (as I do in the image), the plot on the right will have little meaning, just ignore it.

generate-features.png 698.85 KB

What about RGB images?

Edge Impulse also works with RGB images. The problem is that RGB models are slower than grayscale ones, so I suggest you start with the grayscale version, test it on your board, and "upgrade" to the RGB one only if you find it is not performing fine.


  1. navigate to Impulse design > Object detection
  2. set the number of training cycles to 35
  3. set learning rate to 0.005
  4. click on Choose a different model right below the FOMO block
  5. select FOMO (Faster Objects, More Objects) MobileNetV2 0.1

This is the smallest model we can train and the only one that fits fine on non-ESP32S3 camera boards.

fomo-train.png 888.64 KB

Hit Start training and wait until it completes. It can take 4-5 minutes depending on the number of images.

confusion-matrix.png 811.89 KB

To get accurate estimates of inferencing time and memory usage as the above image shows, be sure to select "Espressif ESP-EYE" as target board in the top-right corner of the Edge Impulse page.

If you're satisfied with the results, move to the next step. If you're not, you have to:

  1. collect more / better images. Good input data and labelling is a critical step of Machine Learning
  2. increase the number of training cycles (don't go over 50, it's almost useless)
  3. decrease the learning rate to 0.001. It may help, or not

Now that all looks good, it's time to export the model.

[6/8] Export the Model to an Arduino Library

This is the shortest step.

  1. Navigate to Deployment on the left menu
  2. select Arduino Library
  3. scroll down and hit Build.

A zip containing the model library will download.

fomo-export.png 32.41 KB

[7/8] Sketch deployment

Before I went on this personally, as a beginner with Edge Impulse, I couldn't find any good tutorial on how to deploy an object detection model to the Esp32-camera on the entire web!

That's why I'm keeping this tutorial free instead of making it a premium chapter of my paid eBook.

  1. open the Arduino IDE and create a new sketch
  2. select Esp32 Dev Module or ESP32S3 Dev Module (depending on your board) as model
  3. enable external PSRAM (for S3 boards, select OPI PSRAM)
  4. navigate to Sketch > Include library > Add .zip library and select the zip downloaded from Edge Impulse.
  5. copy the sketch below inside your project

Don't forget to change the name of the Edge Impulse library if your filename is different!

See source

Filename: EdgeImpulse_FOMO_NO_PSRAM.ino

/**
 * Run Edge Impulse FOMO model.
 * It works on both PSRAM and non-PSRAM boards.
 * 
 * The difference from the PSRAM version
 * is that this sketch only runs on 96x96 frames,
 * while PSRAM version runs on higher resolutions too.
 * 
 * The PSRAM version can be found in my
 * "ESP32S3 Camera Mastery" course
 * at https://dub.sh/ufsDj93
 *
 * BE SURE TO SET "TOOLS > CORE DEBUG LEVEL = INFO"
 * to turn on debug messages
 */
#include <your-fomo-project_inferencing.h>
#include <eloquent_esp32cam.h>
#include <eloquent_esp32cam/edgeimpulse/fomo.h>

using eloq::camera;
using eloq::ei::fomo;


/**
 * 
 */
void setup() {
    delay(3000);
    Serial.begin(115200);
    Serial.println("__EDGE IMPULSE FOMO (NO-PSRAM)__");

    // camera settings
    // replace with your own model!
    camera.pinout.aithinker();
    camera.brownout.disable();
    // NON-PSRAM FOMO only works on 96x96 (yolo) RGB565 images
    camera.resolution.yolo();
    camera.pixformat.rgb565();

    // init camera
    while (!camera.begin().isOk())
        Serial.println(camera.exception.toString());

    Serial.println("Camera OK");
    Serial.println("Put object in front of camera");
}


void loop() {
    // capture picture
    if (!camera.capture().isOk()) {
        Serial.println(camera.exception.toString());
        return;
    }

    // run FOMO
    if (!fomo.run().isOk()) {
      Serial.println(fomo.exception.toString());
      return;
    }

    // how many objects were found?
    Serial.printf(
      "Found %d object(s) in %dms\n", 
      fomo.count(),
      fomo.benchmark.millis()
    );

    // if no object is detected, return
    if (!fomo.foundAnyObject())
      return;

    // if you expect to find a single object, use fomo.first
    Serial.printf(
      "Found %s at (x = %d, y = %d) (size %d x %d). "
      "Proba is %.2f\n",
      fomo.first.label,
      fomo.first.x,
      fomo.first.y,
      fomo.first.width,
      fomo.first.height,
      fomo.first.proba
    );

    // if you expect to find many objects, use fomo.forEach
    if (fomo.count() > 1) {
      fomo.forEach([](int i, bbox_t bbox) {
        Serial.printf(
          "#%d) Found %s at (x = %d, y = %d) (size %d x %d). "
          "Proba is %.2f\n",
          i + 1,
          bbox.label,
          bbox.x,
          bbox.y,
          bbox.width,
          bbox.height,
          bbox.proba
        );
      });
    }
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

[8/8] Run

This is going to be the most rewarding step of all this tutorial.

Save the sketch, hit upload, open the Serial Monitor and watch the predictions text magically appearing while you put your objects in front of the camera.

Object Detection at higher resolutions

So far, we've been collecting images at YOLO resolution (96x96) to save RAM and perform little-to-no rescaling on the source image. This allowed us to fit the model on the little space available on non-S3 boards.

But what if you own a S3 board with plenty of RAM?

Why should you not leverage that much more memory?

In fact, you can well capture images at the resolution you prefer (up to the RAM limits, of course) and still have space to run FOMO model.

The following sketch is very similar to the previous one, but with a huge difference: instead of setting yolo resolution with RGB565 encoding, it is using JPEG encoding at VGA resolution (640 x 480).

This source code is only available to paying users.

What else you will get:

  • Advanced motion detection
  • Face detection
  • Advanced Edge Impulse FOMO
  • Edge Impulse FOMO Pan-Tilt
  • Self-driving car

You may ask what difference it makes.

Let's say you're implementing a wildlife camera to shoot at bears in the forest. You may want to store those frames on a SD card when you recognize such an animal.

With the non-PSRAM version, you won't be able to do so, since the image is captured at 96x96 resolution and it would be too small to be of any use. Now, instead, you have access to the full resolution image.

Event-Driven object detection

In the Quickstart sketch, we saw how easy and linear it is to run object detection; it only requires a few lines of code. Nevertheless, the loop() function is pretty lengthy now because it has to continuously check if one or more objects are present in the frame.

In this section, I'm going to show you how to move all the logic into the setup() function instead. We will exploit a style of programming called event driven (or reactive).  Event driven programming consists in registering a listener function that will run when an event of interest happens. In our case, the event of interest is one or more object being detected.

Why is this useful?

To begin, because it allows for leaner loop() code, where you can focus on running other tasks that need to occur in parallel to face detection. And then, event listeners often help to isolate specific functionalities (object detection handling) into their own routines, visually de-cluttering other tasks' code.

Here's the updated sketch.

This source code is only available to paying users.

What else you will get:

  • Advanced motion detection
  • Face detection
  • Advanced Edge Impulse FOMO
  • Edge Impulse FOMO Pan-Tilt
  • Self-driving car

The configuration part is exactly the same as before. The new entry is the daemon object which does 2 things:

  1. accepts event listeners to be run on events of interest
  2. runs the face detection code in background

There are 3 types of listeners you can register:

  1. whenYouDontSeeAnything: this runs when no object is detected
  2. whenYouSeeAny: this runs whenever an object (any object) is detected
  3. whenYouSee(label): this runs only when an object with the given label is detected

Object detection to MQTT

In your specific project, detecting an object may only be the first step in a larger system.

Maybe you want to log how many objects were detected in a database, or get a notification on your phone, or an email... whatever. There are many ways to accomplish this goal. One of the most popular in the maker community is using the MQTT protocol as a mean of systems communication.

The EloquentEsp32Cam has a first party integration for MQTT.

In the following sketch, you will have to replace the test.mosquitto.org broker with your own and (if required) add proper authentication. Beside that, the sketch will work out of the box.

Software requirements

EloquentEsp32Cam >= 2.2
PubSubClient >= 2.8 

This source code is only available to paying users.

What else you will get:

  • Advanced motion detection
  • Face detection
  • Advanced Edge Impulse FOMO
  • Edge Impulse FOMO Pan-Tilt
  • Self-driving car

What is the payload that will be uploaded? It is a JSON description of all the objects found in the frame.

[{"label": "penguin", "x": 8, "y": 16, "w": 32, "h": 32, "proba": 0.8}, ...]
1

Object detection streaming

So far, we were only able to debug object detection in the Serial Monitor. It would be much better if we could visually debug it by seeing the realtime streaming video from the camera at the same time.

As far as I know (at least to the date of Dec 2023), this feature does not exists anywhere else on the web. It took a lot of work to make this happen but it was worth is, since the result is stunning.

When the object that you trained your FOMO model on is recognized, you will see a dot on its centroid.

The streaming may be laggish because of poor wifi. Consider that sketch is mainly for debug purposes, not for real time execution.


Here's the code.

This source code is only available to paying users.

What else you will get:

  • Advanced motion detection
  • Face detection
  • Advanced Edge Impulse FOMO
  • Edge Impulse FOMO Pan-Tilt
  • Self-driving car

Related Posts

Subscribe to my newsletter

Join 951 businesses and hobbysts skyrocketing their Arduino + ESP32 skills twice a month