# 2D Convolution benchmarking

Real world numbers on 2D convolution optimization strategies

2D Convolution is a common operation in today neural networks used for image and audio classification.

In the context of Tiny ML, we need consider any kind of optimization possible to reduce the computation time it takes to run a Conv2D.

**This post runs the same convolution in 4 different flavors and reports the execution time on 2 different chips:
the ARM Cortex M4F (Arduino Nano 33 BLE Sense) and the Tensilica Xtensa LX6 (Esp32)**.

At the end of the post, you'll see how it's possible to gain a speedup of ???x with easy to implement optimizations.

## Problem statement

Given a source matrix (eg. an image) and a kernel filter (usally 3x3 or 5x5), the 2D convolution multiplies each pixel and its neighbors by the corresponding values in the filter.

Take a look at the animation below for a better understanding.

The most basic implementation of such convolution is the following.

```
/**
* Basic implementation of Conv2D
*/
void conv2d_basic() {
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
float acc = 0;
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
for (uint8_t j = 0; j < KERNEL_SIZE; j++) {
acc += source[y + i - 1][x + j - 1] * kernel2D[i][j];
}
}
dest[y][x] = acc;
}
}
}
```

This is the gerenal-purpose, reference implementation of Conv2D.

For a source array of 50x50 elements, it takes **3880 us** on average to run on an ARM Cortex M4F chip.

Now we begin to inspect simple optimizations that can give quite a boost to the execution time.

## Optimization no. 1: Exploding kernel multiplications

If you're working with kernels of size 3 or 5 (as in most of the cases), you can easily get a speed boost
by *exploding* the kernel multiplication, instead of using the `for`

loops.

Here's an example with a 3x3 kernel exploded.

```
/**
* Apply Conv2D exploding the kernel multiplications
*/
void conv2d_exploded() {
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] =
source[y - 1][x - 1] * kernel2D[0][0] +
source[y - 1][x] * kernel2D[0][1] +
source[y - 1][x + 1] * kernel2D[0][2] +
source[y][x - 1] * kernel2D[1][0] +
source[y][x] * kernel2D[1][1] +
source[y][x + 1] * kernel2D[1][2] +
source[y + 1][x - 1] * kernel2D[2][0] +
source[y + 1][x] * kernel2D[2][1] +
source[y + 1][x + 1] * kernel2D[2][2];
}
}
}
```

How faster is this code?

On the same chip, it takes **1880 us** instead of 3880 us.

That's a **2x speedup!**

## Optimization n.2: Separable kernels

Separable kernels are 2D filters that can be decomposed into 2 1D kernels.

Some well known kernels (eg. the Sobel operator) are separable.

Why is this important?

Because this reduces the computational cost of the convolution from `KERNEL_SIZE ^ 2`

to `KERNEL_SIZE * 2`

.
With a 3x3 kernel, you have to perform 6 multiplications per pixel instead of 9: 33% less computations.
With a 5x5 kernel, the reduction is even greater!

```
void conv1d_basic() {
// horizontal convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
dest[y][x] = source[y][x + i - 1] * kernel_x_1D[i];
}
}
}
// vertical convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
for (uint8_t i = 0; i < KERNEL_SIZE; i++) {
dest[y][x] += source[y + i - 1][x] * kernel_y_1D[i];
}
}
}
}
```

This is about **30% faster** than the original implementation, yet slower than the exploded kernel optimization.

What if we merge the two optimizations?

## Optimization n. 3: Separable, exploded kernels

This is simply the merge of Optimization n.1 and Optimization n.2: we expect to get the greatest speedup!

```
void conv1d_exploded() {
// horizontal concvolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] =
source[y][x - 1] * kernel_x_1D[0] +
source[y][x] * kernel_x_1D[1] +
source[y][x + 1] * kernel_x_1D[2];
}
}
// vertical convolution
for (uint16_t y = 1; y < HEIGHT - 1; y++) {
for (uint16_t x = 1; x < WIDTH - 1; x++) {
dest[y][x] +=
source[y - 1][x] * kernel_y_1D[0] +
source[y][x] * kernel_y_1D[1] +
source[y + 1][x] * kernel_y_1D[2];
}
}
}
```

The result?

**1530 us** on the M4F CPU vs 3880 us of the basic implementation.

That's a 2.5x speedup!

And that was with very little modifications to the code.

## How to convert a 2D kernel filter into 1D filters?

Unless the 2D filter is exactly separable, you need to approximate the separation.

I'll soon release a Python library that does this for you, so subscribe to the newsletter to stay updated.

## Conclusions

When implementing math-heavy operations, going the naive way is doomed to fail. Embedded devices require you to think hard about any optimization you can apply to speed up the computation.

Of course, this operations should not be left to you as an end user, but should be take into consideration by the library you use for Machine Learning training and exporting.

Many of them rely on vendor-specific optimized instructions (eg. ARM CMSIS), but some may not.

Always check the quality of the code you're running before deploying!

Or hire an expert who can take care of this for you!

## Benchmark results

### 3x3 kernel

Board | CPU | Image size | Convolution type | Execution time | Speedup |
---|---|---|---|---|---|

Arduino Nano | Cortex M4F @ 64 MHz | 50 x 50 | 2D kernel basic | 3880 | |

2D kernel exploded | 1880 | 2x | |||

1D kernel basic | 2720 | 1.4x | |||

1D kernel exploded | 1530 | 2.5x | |||

Esp32 | Tensilica Xtensa LX6 @ 240 MHz | 50 x 50 | 2D kernel basic | 3030 | |

2D kernel exploded | 760 | 4x | |||

1D kernel basic | 1740 | 1.7x | |||

1D kernel exploded | 600 | 5x | |||

Arduino Portenta H7 | STM32H747 @ 480 Mhz | 50 x 50 | 2D kernel basic | 1080 | |

2D kernel exploded | 470 | 2.3x | |||

1D kernel basic | 660 | 1.6x | |||

1D kernel exploded | 370 | 2.9x |

### 5x5 kernel

Board | CPU | Image size | Convolution type | Execution time | Speedup |
---|---|---|---|---|---|

Arduino Nano | Cortex M4F @ 64 MHz | 50 x 50 | 2D kernel basic | 22310 | |

2D kernel exploded | 9100 | 2.4x | |||

1D kernel basic | 8250 | 2.7x | |||

1D kernel exploded | 4210 | 5.3x | |||

Esp32 | Tensilica Xtensa LX6 @ 240 MHz | 50 x 50 | 2D kernel basic | 6600 | |

2D kernel exploded | 2180 | 3x | |||

1D kernel basic | 2030 | 3.2x | |||

1D kernel exploded | 850 | 7.8x | |||

Arduino Portenta H7 | STM32H747 @ 480 Mhz | 50 x 50 | 2D kernel basic | 2480 | |

2D kernel exploded | 1140 | 2.2x | |||

1D kernel basic | 900 | 2.7x | |||

1D kernel exploded | 520 | 4.8x |