Writing An Optimizing Tensor Compiler From Scratch

Not everyone will write their own optimizing compiler from scratch, but those who do sometimes roll into it during the course of ever-growing project scope creep. People like [Michael Moroz], who wrote up a long and detailed article on the why and how. Specifically, a ‘small library’ involving a few matrix operations for a Unity-based project turned into a static optimizing tensor compiler, called TensorFrost, with a Python front-end and a shader-like syntax, all of which is available on GitHub.

The Python-based front-end implements low-level NumPy-like operations, with development still ongoing. As for why Yet Another Tensor Library had be developed, the reasons were that most of existing libraries are heavily focused on machine learning tasks and scale poorly otherwise, dynamic control flow is hard to implement, and the requirement of writing custom kernels in e.g. CUDA.

Above all [Michael] wanted to use a high-level language instead of pure shader code, and have something that can output graphical data in real-time. Taking the gamble, and leaning on LLVM for some parts, there is now a functional implementation, albeit with still a lot of work ahead.

Block Devices In User Space

Your new project really could use a block device for Linux. File systems are easy to do with FUSE, but that’s sometimes too high-level. But a block driver can be tough to write and debug, especially since bugs in the kernel’s space can be catastrophic. [Jiri Pospisil] suggests Ublk, a framework for writing block devices in user space. This works using the io_uring facility in recent kernels.

This opens the block device field up. You can use any language you want (we’ve seen FUSE used with some very strange languages). You can use libraries that would not work in the kernel. Debugging is simple, and crashing is a minor inconvenience.

Another advantage? Your driver won’t depend on the kernel code. There is a kernel driver, of course, named ublk_drv, but that’s not your code. That’s what your code talks to.

Continue reading “Block Devices In User Space”

BASIC On A Calculator Again

We are always amused that we can run emulations or virtual copies of yesterday’s computers on our modern computers. In fact, there is so much power at your command now that you can run, say, a DOS emulator on a Windows virtual machine under Linux, even though the resulting DOS prompt would probably still perform better than an old 4.77 MHz PC. Remember when you could get calculators that ran BASIC? Well, [Calculator Clique] shows off BASIC running on a decidedly modern HP Prime calculator. The trick? It’s running under Python. Check it out in the video below.

Think about it. The HP Prime has an ARM processor inside. In addition to its normal programming system, it has Micropython as an option. So that’s one interpreter. Then PyBasic has a nice classic Basic interpreter that runs on Python. We’ve even ported it to one or two of the Hackaday Superconference badges.

Continue reading “BASIC On A Calculator Again”

Optimizing Software With Zero-Copy And Other Techniques

An important aspect in software engineering is the ability to distinguish between premature, unnecessary, and necessary optimizations. A strong case can be made that the initial design benefits massively from optimizations that prevent well-known issues later on, while unnecessary optimizations are those simply do not make any significant difference either way. Meanwhile ‘premature’ optimizations are harder to define, with Knuth’s often quoted-out-of-context statement about these being ‘the root of all evil’ causing significant confusion.

We can find Donald Knuth’s full quote deep in the 1974 article Structured Programming with go to Statements, which at the time was a contentious optimization topic. On page 268, along with the cited quote, we see that it’s a reference to making presumed optimizations without understanding their effect, and without a clear picture of which parts of the program really take up most processing time. Definitely sound advice.

And unlike back in the 1970s we have today many easy ways to analyze application performance and to quantize bottlenecks. This makes it rather inexcusable to spend more time today vilifying the goto statement than to optimize one’s code with simple techniques like zero-copy and binary message formats.

Continue reading “Optimizing Software With Zero-Copy And Other Techniques”

Making Code A Hundred Times Slower With False Sharing

The cache hierarchy of the 2008 Intel Nehalem x86 architecture. (Source: Intel)
The cache hierarchy of the 2008 Intel Nehalem x86 architecture. (Source: Intel)

Writing good, performant code depends strongly on an understanding of the underlying hardware. This is especially the case in scenarios like those involving embarrassingly parallel processing, which at first glance ought to be a cakewalk. With multiple threads doing their own thing without having to nag the other threads about anything it seems highly doubtful that even a novice could screw this up. Yet as [Keifer] details in a recent video on so-called false sharing, this is actually very easy, for a variety of reasons.

With a multi-core and/or multi-processor system each core has its own local cache that contains a reflection of the current values in system RAM. If any core modifies its cached data, this automatically invalidates the other cache lines, resulting in a cache miss for those cores and forcing a refresh from system RAM. This is the case even if the accessed data isn’t one that another core was going to use, with an obvious impact on performance. As cache lines are a contiguous block of data with a size and source alignment of 64 bytes on x86, it’s easy enough to get some kind of overlap here.

The worst case scenario as detailed and demonstrated using the Google Benchmark sample projects, involves a shared global data structure, with a recorded hundred times reduction in performance. Also noticeable is the impact on scaling performance, with the cache misses becoming more severe with more threads running.

A less obvious cause of performance loss here is due to memory alignment and how data fits in the cache lines. Making sure that your data is aligned in e.g. data structures can prevent more unwanted cache invalidation events. With most applications being multi-threaded these days, it’s a good thing to not only know how to diagnose false sharing issues, but also how to prevent them.

Continue reading “Making Code A Hundred Times Slower With False Sharing”

A UI-Focused Display Library For The ESP32

If you’re building a project on your ESP32, you might want to give it a fancy graphical interface. If so, you might find a display library from [dejwk] to be particularly useful.

Named roo_display for unclear reasons, the library is Arduino-compatible, and suits a wide range of ESP32 boards out in the wild. It’s intended for use with common SPI-attached display controllers, like the ILI9341, SSD1327, ST7789, and more. It’s performance-oriented, without skimping on feature set. It’s got all kinds of fonts in different weights and sizes, and a tool for importing more. It can do all kinds of shapes if you want to manually draw your UI elements, or you can simply have it display JPEGs, PNGs, or raw image data from PROGMEM if you so desire. If you’re hoping to create a touch interface, it can handle that too. There’s even a companion library for doing more complex work under the name roo_windows.

If you’re looking to create a simple and responsive interface, this might be the library for you. Of course, there are others out there too, like the Adafruit GFX library which we’ve featured before. You could even go full VGA if you wanted, and end up with something that looks straight out of Windows 3.1. Meanwhile, if you’re cooking up your own graphics code for the popular microcontroller platform, you should probably let us know on the tipsline!

Thanks to [Daniel] for the tip!

NPAPI And The Hot-Pluggable World Wide Web

In today’s Chromed-up world it can be hard to remember an era where browsers could be extended with not just extensions, but also with plugins. Although for those of us who use traditional Netscape-based browsers like Pale Moon the use of plugins has never gone away, for the rest of the WWW’s users their choice has been limited to increasingly more restrictive browser extensions, with Google’s Manifest V3 taking the cake.

Although most browsers stopped supporting plugins due to “security concerns”, this did nothing to address the need for executing code in the browser faster than the sedate snail’s pace possible with JavaScript, or the convenience of not having to port native code to JavaScript in the first place. This led to various approaches that ultimately have culminated in the WebAssembly (WASM) standard, which comes with its own set of issues and security criticisms.

Other than Netscape’s Plugin API (NPAPI) being great for making even 1990s browsers ready for 2026, there are also very practical reasons why WASM and JavaScript-based approaches simply cannot do certain basic things.

Continue reading “NPAPI And The Hot-Pluggable World Wide Web”