Going from a microcontroller blinking an LED, to one that blinks the LED using voice commands based on a data set that you trained on a neural net work is a “now draw the rest of the owl” problem. Lucky for us, Shawn Hymel walks us through the entire process during his Tiny ML workshop from the 2020 Hackaday Remoticon. The video has just now been published and can be viewed below.
This is truly an end-to-end Hello World for getting machine learning up and running on a microcontroller. Shawn covers the process of collecting and preparing the audio samples, training the data set, and getting it all onto the microcontroller. At the end of two hours, he’s able to show the STM32 recognizing and responding to two different spoken words. Along the way he pauses to discuss the context of what’s happening in every step, which will help you go back and expand in those areas later to suit your own project needs.
The hardware used in this demonstration is the STM32 Nucleo-L476RG board, but you can use the same techniques on a wide range of ARM boards and other suitably high-performing chips.
Hardware requirements are spelled out on the workshop project page. Shawn has put together some epic documentation on his GitHub repo, including slides for the workshop. In the time since the video was recorded, he’s even made a demo using the Arduino Nano 33 BLE Sense board which uses a Nordic nRF52480 chip.
The bulk of workshop time is spent working through the labyrinth of software platforms and settings used to train the data set. An interesting demonstration of Jupiter notebooks collects and curates 120 minutes worth of 1-second audio samples for training. There’s another 20 minutes worth of test data — these samples were not present in the training set and will be used verify previously unknown input can be successfully classified.
The training process itself is run on a platform called Edge Impulse. This provides a graphical web interface for pulling together the parameters used by the training set. In this case, the audio samples are converted to Mel-frequency cepstral coefficients (MFCCs) and fed into a Keras neural network. (Late in the workshop Shawn touches on how to tweak the Keras code once you begin to get your feet under you with the entire setup.) The microcontroller will convert each incoming audio sample in real time to an MFCC that can be compared against the dataset spit out as a C package by Edge Impulse.
He makes it look pretty easy, and you should definitely give it a try. At the same time it’s a perfect example of why documenting your project, even if it’s just for personal use, is so important. We’re very happy to have the step-by-step from Shawn, but even he references it when getting into the weeds with importing the data set into the STM32CubeIDE software.
Cool! And I was just reading about MFCCs for a project that I need to implement that includes detecting spoken commands…
tl;dr grab an audio sample, trim periods of silence, make a low-res spectrogram of remaining audio data, train your ANN to recognize two classes of images (two spectrograms). That’s it.
Nice sponsored ad for fancy “cloud” services but all in all it’s early 90s tech that we used to run on Pentium MMX PCs in our university.
Crikey, how about a new acronym TL;DW (too long, didn’t watch).
At almost 2 hours, that is longer than most feature films!
I keep saying it – YouTube/video is about the worst form of technical documentation imaginable.
It would seem a good many HaD writers don’t bother, but to Mikes credit (without watching the video), it seems as though he has watched it and pulled out some of the more interesting points – the sort of things that actually do tempt me to watch it.
I agree with your statement completely: video documentation (when it is used as the main, or worse, only) form of documentation is exceptionally lousy (or should that be ‘lazy’).