SIMD-Accelerated Computer Vision On The ESP32-S3

One of the fun parts of the ESP32-S3 microcontroller is that it got upgraded to the newer Cadence Xtensa LX7 processor core, which turns out to have a range of SIMD instructions that can help to significantly speed up a range of tasks. [Shranav Palakurthi] recently used this to speed up the processing of video frames to detect corners using the FAST method. By moving some operations that benefit from SIMD over to an optimized version written in LX7 ASM, the algorithm’s throughput was increased by 220%, from 5.1 MP/s to 11.2 MP/s, albeit with some caveats.

The problem with the SIMD instructions in the LX7 other than them being very poorly documented – unless you sign an NDA with Cadence –  is that it misses many instructions that would be really useful. For [Shranav] the lack of support for direct misaligned reads and comparing of unsigned 8-bit numbers were hurdles, but could be worked around, with the results available on GitHub.

Much of the groundwork for this SIMD implementation was laid by [Larry Bank], who reverse-engineered the SIMD instructions from available documentation and code samples, finding that the ESP32-S3 misses quite a few common SIMD instructions, including various shifts and unaligned reads and writes. Still, it’s good enough for quite a few tasks, as long as you can make it work with the available instructions.

9 thoughts on “SIMD-Accelerated Computer Vision On The ESP32-S3

  1. For someone interested in ESP_32 SHA256 acceleration this article promises a lot but says very little. The article contains no immediately useful information or clear links to actual implementations

      1. Assembler is a platform-specific obsolete software technology.

        1 Write a gcc c program which calls your machine language subprogram.
        2 Use gcc c compiler as much as possible to try to do what your machine-specific code need to so.
        3 Look at the gcc c code disassembly.
        4 Modify 3 as needed to access your platform-specific machine code.
        5 Do this in a char array.
        6 write the machine code char array to a file.
        7 In your gcc c main program read the machine code into a char array.
        8 Read gcc c labels as values.
        9 Then do an indirect jump into your machine code.
        10 You must pass argument references from you c program to you machine code. Example: a = b+c. b and c argument must to passed to you machine code.
        11 You machine code must extract the argument values and place them into plaftfor registers.
        12 Issue you platform instruction.
        13 Place the instruction return onto the argument stack … modifying TOS, of course.
        14 You gcc c program must place the c return address [supplied by the label value]
        15 Do an indirect jump to the return stack value … decrement the return stack TOS pointer.
        16 And hope you see the correct answer in you gcc c calling program, of course.

  2. Note that the SIMD in the ESP32-S3 is not a Cadence thing but an Espressif thing: that is why the entire instruction set is documented in the ESP32S3 TRM. You could make the point that it could benefit up with some examples on how to use it – as the author rightfully mentioned the current thing is to look at esp-dsp.

  3. Looking at the [code](https://github.com/shraiwi/simd-fast-esp32s3/blob/232008ee45abe622d1f9a61943f2cf3270b33c41/lib/simd_fast/simd_fast.c#L308) it seems the author doesn't know about && operator in C/C++ nor about cyclomatic complexity. Doesn't really impress me the C code is running so slow. Or maybe it's because the SIMD functions are all called "simd_fast_something". I guess the compiler does a better job when the functions' name contains "fast". It must think that if the user called the function "fast" it should be faster than the rest of the code.

        1. In that case, the generator is bullshit. I can understand that making truth table for numerous boolean condition is painful for a human, but it’s dumb simple for an algorithm. The whole function above can be converted from probably 1700 LOC to only 30 LOC by any human programmer (and probably less via condition optimizations). Geez, maybe even compiling that stuff with a C compiler in -O3 and decompiling the result back to C would give better and human readable results!

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.