Level up your TinyML skills

2D Convolution benchmarking

Real world numbers on 2D convolution optimization strategies

2D Convolution benchmarking

2D Convolution is a common operation in today neural networks used for image and audio classification.

In the context of Tiny ML, we need consider any kind of optimization possible to reduce the computation time it takes to run a Conv2D.

This post runs the same convolution in 4 different flavors and reports the execution time on 2 different chips: the ARM Cortex M4F (Arduino Nano 33 BLE Sense) and the Tensilica Xtensa LX6 (Esp32).

At the end of the post, you'll see how it's possible to gain a speedup of ???x with easy to implement optimizations.

Problem statement

Given a source matrix (eg. an image) and a kernel filter (usally 3x3 or 5x5), the 2D convolution multiplies each pixel and its neighbors by the corresponding values in the filter.

Take a look at the animation below for a better understanding.

Conv2D animation

The most basic implementation of such convolution is the following.


/**
 * Basic implementation of Conv2D
 */
void conv2d_basic() {
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            float acc = 0;

            for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
                for (uint8_t j = 0; j < KERNEL_SIZE; j++) {
                    acc += source[y + i - 1][x + j - 1] * kernel2D[i][j];
                }
            }

            dest[y][x] = acc;
        }
    }
}
            

This is the gerenal-purpose, reference implementation of Conv2D.

For a source array of 50x50 elements, it takes 3880 us on average to run on an ARM Cortex M4F chip.

Now we begin to inspect simple optimizations that can give quite a boost to the execution time.

Optimization no. 1: Exploding kernel multiplications

If you're working with kernels of size 3 or 5 (as in most of the cases), you can easily get a speed boost by exploding the kernel multiplication, instead of using the for loops.

Here's an example with a 3x3 kernel exploded.


/**
 * Apply Conv2D exploding the kernel multiplications
 */
void conv2d_exploded() {
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            dest[y][x] =
                source[y - 1][x - 1] * kernel2D[0][0] +
                source[y - 1][x] * kernel2D[0][1] +
                source[y - 1][x + 1] * kernel2D[0][2] +
                source[y][x - 1] * kernel2D[1][0] +
                source[y][x] * kernel2D[1][1] +
                source[y][x + 1] * kernel2D[1][2] +
                source[y + 1][x - 1] * kernel2D[2][0] +
                source[y + 1][x] * kernel2D[2][1] +
                source[y + 1][x + 1] * kernel2D[2][2];
        }
    }
}
            

How faster is this code?

On the same chip, it takes 1880 us instead of 3880 us.

That's a 2x speedup!

Optimization n.2: Separable kernels

Separable kernels are 2D filters that can be decomposed into 2 1D kernels.

Some well known kernels (eg. the Sobel operator) are separable.

Why is this important?

Because this reduces the computational cost of the convolution from KERNEL_SIZE ^ 2 to KERNEL_SIZE * 2. With a 3x3 kernel, you have to perform 6 multiplications per pixel instead of 9: 33% less computations. With a 5x5 kernel, the reduction is even greater!


void conv1d_basic() {
    // horizontal convolution
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
                dest[y][x] = source[y][x + i - 1] * kernel_x_1D[i];
            }
        }
    }

    // vertical convolution
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
                dest[y][x] += source[y + i - 1][x] * kernel_y_1D[i];
            }
        }
    }
}
            

This is about 30% faster than the original implementation, yet slower than the exploded kernel optimization.

What if we merge the two optimizations?

Optimization n. 3: Separable, exploded kernels

This is simply the merge of Optimization n.1 and Optimization n.2: we expect to get the greatest speedup!


void conv1d_exploded() {
    // horizontal concvolution
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            dest[y][x] =
                source[y][x - 1] * kernel_x_1D[0] +
                source[y][x] * kernel_x_1D[1] +
                source[y][x + 1] * kernel_x_1D[2];
        }
    }

    // vertical convolution
    for (uint16_t y = 1; y < HEIGHT - 1; y++) {
        for (uint16_t x = 1; x < WIDTH - 1; x++) {
            dest[y][x] +=
                source[y - 1][x] * kernel_y_1D[0] +
                source[y][x] * kernel_y_1D[1] +
                source[y + 1][x] * kernel_y_1D[2];
        }
    }
}
            

The result?

1530 us on the M4F CPU vs 3880 us of the basic implementation.

That's a 2.5x speedup!

And that was with very little modifications to the code.

How to convert a 2D kernel filter into 1D filters?

Unless the 2D filter is exactly separable, you need to approximate the separation.

I'll soon release a Python library that does this for you, so subscribe to the newsletter to stay updated.

Conclusions

When implementing math-heavy operations, going the naive way is doomed to fail. Embedded devices require you to think hard about any optimization you can apply to speed up the computation.

Of course, this operations should not be left to you as an end user, but should be take into consideration by the library you use for Machine Learning training and exporting.

Many of them rely on vendor-specific optimized instructions (eg. ARM CMSIS), but some may not.

Always check the quality of the code you're running before deploying!

Or hire an expert who can take care of this for you!

Benchmark results

3x3 kernel

Board CPU Image size Convolution type Execution time Speedup
Arduino Nano Cortex M4F @ 64 MHz 50 x 50 2D kernel basic 3880
2D kernel exploded 1880 2x
1D kernel basic 2720 1.4x
1D kernel exploded 1530 2.5x
Esp32 Tensilica Xtensa LX6 @ 240 MHz 50 x 50 2D kernel basic 3030
2D kernel exploded 760 4x
1D kernel basic 1740 1.7x
1D kernel exploded 600 5x
Arduino Portenta H7 STM32H747 @ 480 Mhz 50 x 50 2D kernel basic 1080
2D kernel exploded 470 2.3x
1D kernel basic 660 1.6x
1D kernel exploded 370 2.9x

5x5 kernel

Board CPU Image size Convolution type Execution time Speedup
Arduino Nano Cortex M4F @ 64 MHz 50 x 50 2D kernel basic 22310
2D kernel exploded 9100 2.4x
1D kernel basic 8250 2.7x
1D kernel exploded 4210 5.3x
Esp32 Tensilica Xtensa LX6 @ 240 MHz 50 x 50 2D kernel basic 6600
2D kernel exploded 2180 3x
1D kernel basic 2030 3.2x
1D kernel exploded 850 7.8x
Arduino Portenta H7 STM32H747 @ 480 Mhz 50 x 50 2D kernel basic 2480
2D kernel exploded 1140 2.2x
1D kernel basic 900 2.7x
1D kernel exploded 520 4.8x

Get monthly updates

Do not miss the next posts on TinyML and Esp32 camera. No spam, I promise

We use Mailchimp as our marketing platform. By submitting this form, you acknowledge that the information you provided will be transferred to Mailchimp for processing in accordance with their terms of use. We will use your email to send you updates relevant to this website.

Having troubles? Ask a question

Need a one to one call? Book now!

© Copyright 2022 Eloquent Arduino. All Rights Reserved.