Bare-Metal STM32: Using The I2C Bus In Master-Transceiver Mode

As one of the most popular buses today for on- and inter-board communication within systems, there’s a good chance you’ll end up using it with an embedded system. I2C offers a variety of speeds while requiring only two wires (clock and data), which makes it significantly easier to handle than alternatives, such as SPI. Within the STM32 family of MCUs, you will find at least one I2C peripheral on each device.

As a shared, half-duplex medium, I2C uses a rather straightforward call-and-response design, where one device controls the clock, and other devices simply wait and listen until their fixed address is sent on the I2C bus. While configuring an STM32 I2C peripheral entails a few steps, it is quite painless to use afterwards, as we will see in this article. Continue reading “Bare-Metal STM32: Using The I2C Bus In Master-Transceiver Mode”

Data Alignment Across Architectures: The Good, The Bad And The Ugly

Even though a computer’s memory map looks pretty smooth and very much byte-addressable at first glance, the same memory on a hardware level is a lot more bumpy. An essential term a developer may come across in this context is data alignment, which refers to how the hardware accesses the system’s random access memory (RAM). This and others are properties of the RAM and memory bus implementation of the system, with a variety of implications for software developers.

For a 32-bit memory bus, the optimal access type for some data would be a four bytes, aligned exactly on a four-byte border within memory. What happens when unaligned access is attempted – such as reading said four-byte value aligned halfway into a word – is implementation defined. Some hardware platforms have hardware support for unaligned access, others throw an exception that the operating system (OS) can catch and fallback to an unaligned routine in software. Other platforms will generally throw a bus error (SIGBUS in POSIX) if you attempt unaligned access.

Yet even if unaligned memory access is allowed, what is the true performance impact? Continue reading “Data Alignment Across Architectures: The Good, The Bad And The Ugly”

AVX-512: When The Bits Really Count

For the majority of workloads, fiddling with assembly instructions isn’t worth it. The added complexity and code obfuscation generally outweigh the relatively modest gains. Mainly because compilers have become quite fantastic at generation code and because processors are just so much faster, it is hard to get a meaningful speedup by tweaking a small section of code. That changes when you introduce SIMD instructions and need to decode lots of bitsets fast. Intel’s fancy AVX-512 SIMD instructions can offer some meaningful performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of ints/enums encoded into a binary number, each bit corresponding to a different integer or enum). Rather than checking if just a specific flag is present (a bitwise and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to iterate through all of them like so:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have a branch misprediction, and either you or the compiler would speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors has some handy instructions just for this kind of thing. The instruction is vpcompressd and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the infamous popcnt instruction to get the number of ones. Some early benchmark testing shows the AVX-512 version uses 45% fewer cycles. You might be wondering, doesn’t the processor downclock when wide 512-bite registers are used? Yes. But even with the downclocking, the SIMD version is still 33% faster. The code is up on Github if you want to try it yourself.

Creating An Image Format For Embedded Hardware

Whether its one of those ubiquitous little OLED displays or a proper LCD panel, once you’ve got something a bit more capable than the classic 16×2 character LCD wired up to your microcontroller, there’s an excellent chance you’ll want to start displaying some proper images. Generally speaking that means you’ll be working with bitmap files, but as you might expect when pushing a decades-old file format into an application it was never intended for, things can get a little messy. Which is why [gfcwfzkm] has created the Portable Image File (PIF) format.

This low-overhead image format is designed specifically for microcontrollers, and can be decoded on devices with at least 60 bytes of free RAM. Images stored with PIF not only require fewer computational resources to process, but equally important, take up less space on flash. The format supports both color and monochrome images, and the GitHub repo even includes a graphical Python 3.10 tool that lets you convert your images to either .pif files or a .h header file for embedding directly into your C code.

[gfcwfzkm] has provided some source code to show you how to get the PIF library up and running, but as of the time of this writing, there isn’t any example code for using PIF within the Arduino environment. That’s no big deal for the old hands in the audience, but we’re interested in seeing how the community can make use of this file format once it’s available in a bit more beginner-friendly package. It’s one of the final unchecked items on the todo list though, so it shouldn’t be long now.

Of course nothing isĀ wrong with using bitmaps to display images in your microcontroller projects, and there’s a certain advantage to fiddling around with the well-known image format. But if a new file type is all it takes to speed up access times and cram a few more images onto the chip, we’re definitely ready to upgrade.

Reflecting On A Queueing Prism Leads To Unexpected Results

Computers are difficult enough to reason about when there’s just a single thread doing one task. There are dozens of cores in today’s modern processor world, and your program might try to take advantage of using more than just one. Things happening concurrently makes the number of states and interactions explode in to a mess we as humans are likely going to have trouble understanding. So, like [Hillel], you might turn to the computer to try and model those interactions.

The model in question is a task queue. Things are added to the pile, and “workers” grab one from the pile and process it. There are two metrics used to measure the effectiveness of a task queue: throughput and latency. Throughput is the number of things you can do per second (like this maximum throughput 3d printer), while latency is the amount of time it takes to finish one thing. Continue reading “Reflecting On A Queueing Prism Leads To Unexpected Results”

Bringing Zelda Classic To The Browser

Finding a device or app that isn’t a web browser doesn’t seem easy. These days, it is either connected to the web (looking at you ESP32) or is just a web browser pretending to be something else (a la electron, PWAs, or React Native). So, of course, it is on us to create more and more exciting things to browse. [Connor Clark] is one of those people, and he brought Zelda Classic to the browser.

Zelda Classic (ZC) isn’t an official Zelda game. Instead, it’s an old engine designed to run the world in the OG Legend of Zelda and be easily modified to support hundreds of different games. To date, there are over 600 games submitted by a large community. ZC is an Allegro-based Windows-only game, so the first step was to bust out Emscripten to start tweaking the C++ code to support a web environment. Rather than completely port the huge codebase over from Allegro, [Connor] made the jump from Allegro 4 to 5. Allegro 5 has SDL as a backend and adds support for Emscripten.

Unfortunately, the 4 to 5 wasn’t as simple as changing the dependency. The API was wholly re-written, and there is a handy adapter known as Allegro Legacy to help transition a project from one to another. After squashing a multitude of bugs, it was a relatively painless procedure. After a quick detour getting music and level data working, [Connor] faced his next challenge: multi-threading. Efforts to move the main loop off of the browser thread and into a web worker ran into issues with having to yield in loops, deadlocks, and recursive mutexes. Finally, he added music and gamepad support after fixing several bugs in SDL and Allegro.

It’s an incredible journey with many tips and tricks for debugging seemingly intractable bugs. The code is up on GitHub, or jump in and start playing if you’re interested. Why not check out this browser-based OpenSCAD as well?

Arduino And Git: Two Views

You can’t do much development without running into Git, the version control management system. Part of that is because so much code lives on GitHub which uses Git, although you don’t need to know anything about that if all you want to do is download code. [Dr. Torq] has a good primer on using Git with the Arduino IDE, if you need to get your toes wet.

You might think if you develop by yourself you don’t need something like Git. However, using a version control system is a great convenience, especially if you use it correctly. There’s a bug out in the field? What version of the firmware? You can immediately get a copy of the source code at that point in time using Git. A feature is broken? It is very easy to see exactly what changed. So even if you don’t work in a team, there are advantages to having source code under control.

Continue reading “Arduino And Git: Two Views”