Musings On A Good Parallel Computer

March 23, 2025 by Maya Posch 28 Comments

Until the late 1990s, the concept of a 3D accelerator card was something generally associated with high-end workstations. Video games and kin would run happily on the CPU in one’s desktop system, with later extensions like MMX, 3DNow!, and SSE providing a significant performance boost for games that supported them. As 3D accelerator cards (colloquially called graphics processing units, or GPUs) became prevalent, they took over almost all SIMD vector tasks, but one thing that they’re not good at is being a general-purpose parallel computer. This really ticked [Raph Levien] off and it inspired him to cover his grievances.

Although the interaction between CPUs and GPUs has become tighter over the decades, with PCIe in particular being a big improvement over AGP and PCI, GPUs are still terrible at running arbitrary computing tasks, and even PCIe links are still glacial compared to communication within the GPU and CPU dies. With the introduction of asynchronous graphic APIs this divide became even more intense. [Raph]’s proposal is to invert this relationship.

There’s precedent for this already, with Intel’s Larrabee and IBM’s Cell processor merging CPU and GPU characteristics on a single die, though both struggled with developing for such a new kind of architecture. Sony’s PlayStation 3 was forced to add a GPU due to these issues. There is also the DirectStorage API in DirectX, which bypasses the CPU when loading assets from storage, effectively adding CPU features to GPUs.

As [Raph] notes, so-called AI accelerators also have these characteristics, with often multiple SIMD-capable, CPU-like cores. Maybe the future is Cell after all.

Register Renaming: The Art Of Parallel Processing

November 16, 2024 by Heidi Ulrich 12 Comments

In the quest for faster computing, modern CPUs have turned to innovative techniques to optimize instruction execution. One such technique, register renaming, is a crucial component that helps us achieve the impressive multi-tasking abilities of modern processors. If you’re keen on hacking or tinkering with how CPUs manage tasks, this is one concept you’ll want to understand. Here’s a breakdown of how it works and you can watch the video, below.

In a nutshell, register renaming allows CPUs to bypass the restrictions imposed by a limited number of registers. Consider a scenario where two operations need to access the same register at once: without renaming, the CPU would be stuck, having to wait for one task to complete before starting another. Enter the renaming trick—registers are reassigned on the fly, so different tasks can use the same logical register but physically reside in different slots. This drastically reduces idle time and boosts parallel tasking. Of course, you also have to ensure that the register you are using has the correct contents at the time you are using it, but there are many ways to solve that problem. The basic technique dates back to some IBM System/360 computers and other high-performance mainframes.

Continue reading “Register Renaming: The Art Of Parallel Processing” →

What Happens If You Speedrun Making A CPU?

November 6, 2024 by Julian Scheffers 28 Comments

Usually, designing a CPU is a lengthy process, especially so if you’re making a new ISA too. This is something that can take months or even years before you first get code to run. But what if it wasn’t? What if one were to try to make a CPU as fast as humanly possible? That’s what I asked myself a couple weeks ago.

Left-to-right: Green, orange and red rectangle with 1:2 aspect ratio. Each rectangle further right has 4x the area of its neighbor on the left. — Relative ROM size. Left: Stovepipe, center: [Ben Eater]’s, right: GR8CPU Rev. 2

Enter the “Stovepipe” CPU (I don’t have an explanation for that name other than that I “needed” one). Stovepipe’s hardware was made in under 4 hours, excluding a couple small bugfixes. I started by designing the ISA, which is the simplest ISA I ever made. Instead of continuously adding things to make it more useful, I removed things that weren’t strictly necessary until I was satisfied. Eventually, all that was left were 8 major opcodes and a mere 512 bits to represent it all. That is far less than GR8CPU (8192 bit), my previous in this class of CPU, and still less than [Ben Eater]’s breadboard CPU (2048 bit), which is actually less flexible than Stovepipe. All that while taking orders of magnitude less time to create than either larger CPU. How does that compare to other CPUs? And: How is that possible?
Continue reading →

Why X86 Needs To Die

March 21, 2024 by Julian Scheffers 209 Comments

As I’m sure many of you know, x86 architecture has been around for quite some time. It has its roots in Intel’s early 8086 processor, the first in the family. Indeed, even the original 8086 inherits a small amount of architectural structure from Intel’s 8-bit predecessors, dating all the way back to the 8008. But the 8086 evolved into the 186, 286, 386, 486, and then they got names: Pentium would have been the 586.

Along the way, new instructions were added, but the core of the x86 instruction set was retained. And a lot of effort was spent making the same instructions faster and faster. This has become so extreme that, even though the 8086 and modern Xeon processors can both run a common subset of code, the two CPUs architecturally look about as far apart as they possibly could.

So here we are today, with even the highest-end x86 CPUs still supporting the archaic 8086 real mode, where the CPU can address memory directly, without any redirection. Having this level of backwards compatibility can cause problems, especially with respect to multitasking and memory protection, but it was a feature of previous chips, so it’s a feature of current x86 designs. And there’s more!

I think it’s time to put a lot of the legacy of the 8086 to rest, and let the modern processors run free. Continue reading “Why X86 Needs To Die” →

Minecraft In Minecraft On The CHUNGUS II

May 27, 2023 by Elliot Williams 16 Comments

Minecraft is a simple video game. Well, it’s a simple video game that also has within it the ability to create all of the logic components that you’d need to build a computer. And building CPUs in Minecraft is by now a long-standing tradition.

Enter CHUNGUS II. The Computational Humongous Unconventional Number and Graphics Unit by [Sammyuri] is the biggest and baddest Minecraft computer that we’ve ever seen. So big, in fact, that it was finally reasonable to think about porting a stripped-down version of Minecraft to the computer itself. Yes, that’s right, Minecraft running in Minecraft. (Video embedded below.) Writing the compiler and programming the game brought two more hackers to the party, [Uwerta] and [StackDoubleFlow], and quite honestly, we’re amazed that a team as small as three people pulled this off.

Anyway, once you’ve picked your jaw up off the floor, also check out [Sammyuri]’s video on just the CHUNGUS II computer itself. (Also embedded below.) Seeing the architecture is interesting, even if you don’t speak Redstone as fluently as our heroes here. We love that the assembler creates a block of ROM – out of Minecraft blocks – that you can then cut/paste into the game’s reality.

For a “simple” game about breaking blocks and punching trees, Minecraft has inspired hackers to make the game better both inside and outside of the real world. For instance, for the latest in performant open-source Minecraft servers, check out Folia. Maybe, one day, they’ll build CHUNGUS II in the real world. It could happen.

Thanks [dbcdr] for the tip!

Continue reading “Minecraft In Minecraft On The CHUNGUS II” →

History Of The SPARC CPU Architecture

March 28, 2023 by Joseph Long 31 Comments

[RetroBytes] nicely presents the curious history of the SPARC processor architecture. SPARC, short for Scalable Processor Architecture, defined some of the most commercially successful RISC processors during the 1980s and 1990s. SPARC was initially developed by Sun Microsystems, which most of us associate the SPARC but while most computer architectures are controlled by a single company, SPARC was championed by dozens of players. The history of SPARC is not simply the history of Sun.

A Reduced Instruction Set Computer (RISC) design is based on an Instruction Set Architecture (ISA) that runs a limited number of simpler instructions than a Complex Instruction Set Computer (CISC) based on an ISA that comprises more, and more complex, instructions. With RISC leveraging simpler instructions, it generally requires a longer sequence of those simple instructions to complete the same task as fewer complex instructions in a CISC computer. The trade-off being the simple (more efficient) RISC instructions are usually run faster (at a higher clock rate) and in a highly pipelined fashion. Our overview of the modern ISA battles presents how the days of CISC are essentially over. Continue reading “History Of The SPARC CPU Architecture” →

Modern CPUs Are Smarter Than You Might Realize

March 9, 2022 by Al Williams 41 Comments

When it comes to programming, most of us write code at a level of abstraction that could be for a computer from the 1960s. Input comes in, you process it, and you produce output. Sure, a call to strcpy might work better on a modern CPU than on an older one, but your basic algorithms are the same. But what if there were ways to define your programs that would work better on modern hardware? That’s what a pre-print book from [Sergey Slotin] answers.

As a simple example, consider the effects of branching on pipelining. Nearly all modern computers pipeline. That is, one instruction is fetching data while an older instruction is computing something, while an even older instruction is storing its results. The problem arises when you already have an instruction partially executed when you realize that an earlier instruction caused a branch to another part of your code. Now the pipeline has to be backed out and performance suffers while the pipeline refills. Anything that had an effect has to reverse and everything else needs to be discarded.

That’s bad for performance. Because of this, some CPUs try to predict if a branch is likely to occur or not and then speculatively fill the pipeline for the predicted case. However, you can structure your code, for example, so that it is more obvious how branching will occur or even, for some compilers, explicitly inform the compiler if the branch is likely or not.

As you might expect, techniques like this depend on your CPU and you’ll need to benchmark to show what’s really going on. The text is full of graphs of execution times and an analysis of the generated assembly code for x86 to explain the results. Even something you think is a pretty good algorithm — like binary search, for example, suffers on modern architectures and you can improve its performance with some tricks. Actually, it is interesting that the tricks work on GCC, but don’t make a difference on Clang. Again, you have to measure these things.

Probably 90% of us will never need to use any of the kind of optimization you’ll find in this book. But it is a marvelous book if you enjoy solving puzzles and analyzing complex details. Of course, if you need to squeeze those extra microseconds out of a loop or you are writing a library where performance is important, this might be just the book you are looking for. Although it doesn’t cover many different CPUs, the ideas and techniques will apply to many modern CPU architectures. You’ll just have to do the work to figure out how if you use a different CPU.

We’ve looked at pieces of this sort of thing before. Pipelining, for example. Sometimes, though, optimizing your algorithm isn’t as effective as just changing it for a better one.