2D Convolution benchmarking
Real world numbers on 2D convolution optimization strategies

2D Convolution is a common operation in today neural networks used for image and audio classification.
In the context of Tiny ML, we need consider any kind of optimization possible to reduce the computation time it takes to run a Conv2D.
This post runs the same convolution in 4 different flavors and reports the execution time on 2 different chips: the ARM Cortex M4F (Arduino Nano 33 BLE Sense) and the Tensilica Xtensa LX6 (Esp32).
At the end of the post, you'll see how it's possible to gain a speedup of ???x with easy to implement optimizations.
Problem statement
Given a source matrix (eg. an image) and a kernel filter (usally 3x3 or 5x5), the 2D convolution multiplies each pixel and its neighbors by the corresponding values in the filter.
Take a look at the animation below for a better understanding.
The most basic implementation of such convolution is the following.
/**
* Basic implementation of Conv2D
*/
void conv2d_basic() {
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
float acc = 0;
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
for (uint8_t j = 0; j < KERNEL_SIZE; j++) {
acc += source[y + i - 1][x + j - 1] * kernel2D[i][j];
}
}
dest[y][x] = acc;
}
}
}
This is the gerenal-purpose, reference implementation of Conv2D.
For a source array of 50x50 elements, it takes 3880 us on average to run on an ARM Cortex M4F chip.
Now we begin to inspect simple optimizations that can give quite a boost to the execution time.

Optimization no. 1: Exploding kernel multiplications
If you're working with kernels of size 3 or 5 (as in most of the cases), you can easily get a speed boost
by exploding the kernel multiplication, instead of using the for
loops.
Here's an example with a 3x3 kernel exploded.
/**
* Apply Conv2D exploding the kernel multiplications
*/
void conv2d_exploded() {
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] =
source[y - 1][x - 1] * kernel2D[0][0] +
source[y - 1][x] * kernel2D[0][1] +
source[y - 1][x + 1] * kernel2D[0][2] +
source[y][x - 1] * kernel2D[1][0] +
source[y][x] * kernel2D[1][1] +
source[y][x + 1] * kernel2D[1][2] +
source[y + 1][x - 1] * kernel2D[2][0] +
source[y + 1][x] * kernel2D[2][1] +
source[y + 1][x + 1] * kernel2D[2][2];
}
}
}
How faster is this code?
On the same chip, it takes 1880 us instead of 3880 us.
That's a 2x speedup!
Optimization n.2: Separable kernels
Separable kernels are 2D filters that can be decomposed into 2 1D kernels.
Some well known kernels (eg. the Sobel operator) are separable.
Why is this important?
Because this reduces the computational cost of the convolution from KERNEL_SIZE ^ 2
to KERNEL_SIZE * 2
.
With a 3x3 kernel, you have to perform 6 multiplications per pixel instead of 9: 33% less computations.
With a 5x5 kernel, the reduction is even greater!
void conv1d_basic() {
// horizontal convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
dest[y][x] = source[y][x + i - 1] * kernel_x_1D[i];
}
}
}
// vertical convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
dest[y][x] += source[y + i - 1][x] * kernel_y_1D[i];
}
}
}
}
This is about 30% faster than the original implementation, yet slower than the exploded kernel optimization.
What if we merge the two optimizations?
Optimization n. 3: Separable, exploded kernels
This is simply the merge of Optimization n.1 and Optimization n.2: we expect to get the greatest speedup!
void conv1d_exploded() {
// horizontal concvolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] =
source[y][x - 1] * kernel_x_1D[0] +
source[y][x] * kernel_x_1D[1] +
source[y][x + 1] * kernel_x_1D[2];
}
}
// vertical convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] +=
source[y - 1][x] * kernel_y_1D[0] +
source[y][x] * kernel_y_1D[1] +
source[y + 1][x] * kernel_y_1D[2];
}
}
}
The result?
1530 us on the M4F CPU vs 3880 us of the basic implementation.
That's a 2.5x speedup!
And that was with very little modifications to the code.
How to convert a 2D kernel filter into 1D filters?
Unless the 2D filter is exactly separable, you need to approximate the separation.
I'll soon release a Python library that does this for you, so subscribe to the newsletter to stay updated.
Conclusions
When implementing math-heavy operations, going the naive way is doomed to fail. Embedded devices require you to think hard about any optimization you can apply to speed up the computation.
Of course, this operations should not be left to you as an end user, but should be take into consideration by the library you use for Machine Learning training and exporting.
Many of them rely on vendor-specific optimized instructions (eg. ARM CMSIS), but some may not.
Always check the quality of the code you're running before deploying!
Or hire an expert who can take care of this for you!
Benchmark results
3x3 kernel
Board | CPU | Image size | Convolution type | Execution time | Speedup |
---|---|---|---|---|---|
Arduino Nano | Cortex M4F @ 64 MHz | 50 x 50 | 2D kernel basic | 3880 | |
2D kernel exploded | 1880 | 2x | |||
1D kernel basic | 2720 | 1.4x | |||
1D kernel exploded | 1530 | 2.5x | |||
Esp32 | Tensilica Xtensa LX6 @ 240 MHz | 50 x 50 | 2D kernel basic | 3030 | |
2D kernel exploded | 760 | 4x | |||
1D kernel basic | 1740 | 1.7x | |||
1D kernel exploded | 600 | 5x | |||
Arduino Portenta H7 | STM32H747 @ 480 Mhz | 50 x 50 | 2D kernel basic | 1080 | |
2D kernel exploded | 470 | 2.3x | |||
1D kernel basic | 660 | 1.6x | |||
1D kernel exploded | 370 | 2.9x |
5x5 kernel
Board | CPU | Image size | Convolution type | Execution time | Speedup |
---|---|---|---|---|---|
Arduino Nano | Cortex M4F @ 64 MHz | 50 x 50 | 2D kernel basic | 22310 | |
2D kernel exploded | 9100 | 2.4x | |||
1D kernel basic | 8250 | 2.7x | |||
1D kernel exploded | 4210 | 5.3x | |||
Esp32 | Tensilica Xtensa LX6 @ 240 MHz | 50 x 50 | 2D kernel basic | 6600 | |
2D kernel exploded | 2180 | 3x | |||
1D kernel basic | 2030 | 3.2x | |||
1D kernel exploded | 850 | 7.8x | |||
Arduino Portenta H7 | STM32H747 @ 480 Mhz | 50 x 50 | 2D kernel basic | 2480 | |
2D kernel exploded | 1140 | 2.2x | |||
1D kernel basic | 900 | 2.7x | |||
1D kernel exploded | 520 | 4.8x |
Having troubles? Ask a question
Related posts
tinyml
TensorFlow Lite on Esp32
Convert an existing TensorFlow Lite model to be deployed on your ESP32 board using the Arduino IDE
tinyml
Gesture classification
Classify accelerometer's data with blazing fast spectral features extraction
libraries
MicroMLGen for Python
A Python library to export scikit-learn models into C format with a single line of code
tinyml
Arduino WiFi Indoor Positioning
Localize people and objects as they move around your building with the power of Machine Learning
tinyml
Color classification with TinyML
Get familiar with data collection and pre-processing for Machine Learning tasks