Usually, designing a CPU is a lengthy process, especially so if you’re making a new ISA too. This is something that can take months or even years before you first get code to run. But what if it wasn’t? What if one were to try to make a CPU as fast as humanly possible? That’s what I asked myself a couple weeks ago.
Enter the “Stovepipe” CPU (I don’t have an explanation for that name other than that I “needed” one). Stovepipe’s hardware was made in under 4 hours, excluding a couple small bugfixes. I started by designing the ISA, which is the simplest ISA I ever made. Instead of continuously adding things to make it more useful, I removed things that weren’t strictly necessary until I was satisfied. Eventually, all that was left were 8 major opcodes and a mere 512 bits to represent it all. That is far less than GR8CPU (8192 bit), my previous in this class of CPU, and still less than [Ben Eater]’s breadboard CPU (2048 bit), which is actually less flexible than Stovepipe. All that while taking orders of magnitude less time to create than either larger CPU. How does that compare to other CPUs? And: How is that possible?
Stovepipe was made at break-neck speeds
Like I said earlier, Stovepipe’s hardware was finished after a mere 4 hours. Add another 2 total hours for the assembler I made afterwards, for a total of 6 including the programs, spread over one week. I estimate GR8CPU was originally designed in just over a year, including tooling, in occasional afternoons after school spent designing. That timespan is notably over 50 times longer than the week that Stovepipe was spread over. In a similar light, Boa³²’s minimum viable product (RV32I) was completed in almost exactly two months, or 8-ish weeks. Still, 8 times as long as Stovepipe took to make. I have no concrete numbers of course but I believe that the real time spent in hours to be even worse for both GR8CPU and Boa³²; almost certainly more than 50x and 8x the hours (so 300 and 48 at the very bare minimum) respectively. How is that possible?
Because it is a simple CPU
Part of it is, of course, experience. GR8CPU, which has appeared on Hackaday long before I was a writer, was my second ever microarchitecture and [Ben Eater] didn’t exactly start studying CPUs immediately after his YouTube series like I did. However, Stovepipe is also an exercise in minimalism; unlike both GR8CPU and [Ben Eater]’s, the only user-accessibe register is the accumulator and every calculation with a second operand has to deal with memory. It has 256 bytes of RAM, on par with GR8CPU, but no I/O ports of any kind; all I/O must be memory-mapped. Stovepipe instructions take 1 cycle to fetch and 1-3 to run (except NOP, which takes 0 cycles to run). On par with both GR8CPU and [Ben Eater]’s, it has a carry out flag and zero flag.
Compare this to my most recent previous CPU, Boa³² (a RISC-V implementation), which is larger by a seemingly extreme amount despite being only about as powerful as modern microcontrollers. It’s 32-bit, has 31 general-purpose registers, 3 of which are usually used for special purposes, a full 4GiB address space, 512KiB of which contains RAM, hardware multiply/divide and atomics, etc. And most importantly, is pipelined and has separate address and data busses, unlike Stovepipe, GR8CPU and [Ben Eater]’s, all of which are multi-cycle single-bus architectures with a dedicated address register.
But how does it perform?
Let’s compare two programs: Computing the fibonacci sequence and multiplying an 8-bit number; across three CPUs: Stovepipe, GR8CPU and Boa³². I will write it in assembly for all three, ignoring Boa³²’s hardware multiply to keep it fair. Let’s dust off the old projects for a short moment, shall we?
CPU | Multiply set-up | Multiply loop | Fibonacci set-up | Fibonacci loop |
---|---|---|---|---|
GR8CPU | 27 | 22-38 | 24 | 40 |
Boa³² | 2 | 7-8 | 3 | 8 |
Stovepipe | 18 | 22-29 | 15 | 27 |
To my surprise, GR8CPU actually performs significantly worse than Stovepipe, mainly due to it needing 3 cycles to load an instruction compared to Stovepipe’s 1. On the other hand, to absolutely nobody’s surprise, Boa³² wipes the floor with both Stovepipe and GR8CPU because of its 32 registers and pipelined nature. It executes most instructions in a single cycle spread over its 5-stage pipeline.
Conclusion
Trying to speedrun making a CPU was clearly a success given the scope; in merely 4 total hours, I made a CPU that outperforms my old 8-bit CPU while being much smaller. The whole exercise shows that simpler is sometimes better, though not always, because the speed-optimized Boa³² easily beats the size-optimized Stovepipe in a landslide performance victory. Stovepipe, however, completely demolishes most CPUs I know in terms of size; [Ben Eater]’s, GR8CPU and better-known CPUs like the 8086, 6502, z80, etc. are all easily defeated by Stovepipe in this respect. That’s not a world record, though; I believe that [olofk]’s SERV CPU is smaller than Stovepipe, though I cannot make a direct comparison due to Stovepipe existing only in a logic simulator.
By the way: If I do ever do a Stovepipe 2, I’ll record the entire time with an actual speedrun timer ;)
FPGA?
I would say yes. Breadboarding even a simple CPU like this one on TTLs would definitely take more time.
Good, now make this CPU able to work in a parallel fashion.
The article reads like the CPU wasn’t implemented, which is fine, but the title reads like it was. Comparing cycle count for specific programs isnt super fair, as one could design a CISC CPU with arbitrarily long delay paths that would be either unroutahle or have arbitrarily slow clock speeds. I imagine this isn’t the case here due to stovepipe’s simplicity, but it would be great to see if it can place&route reasonably well.
Maybe there’ll be a follow-up article! I’d also love to see Olof’s SERV core thrown into the mix ;)
Stovepipe was made in Logisim, a digital logic simulator by cburch. I have no plans to port Stovepipe to Verilog nor do I have plans to build it out of TTL because those are for a new (non-speedrun) size-optimized CPU called Faerie.
okay, cool, no doubtm seriously, I like these kind of project (fast or slow).
But “Stovepipe’s hardware was made in under 4 hours” and no pictures… I’d like to see some pictures or is there no physical hardware, if so then I’m confused. I assume this is the first article in a series?
Please tell us more…
My bad, I added a screenshot of it for you ;)
Ok, now synthesize it, let me know what frequency it can run at, and it’s performance relative to something else for some basic benchmarks.
It’s just another paper cpu.