It isn’t for everyone, but if you work much with computers at a low level, you’ll probably sooner or later entertain the idea of creating your own CPU. There was a time when that was a giant undertaking, but with today’s tools and FPGAs it is… well, not easy, but certainly easier. If you have the urge to try your own, you might have a look at [Simply Explained’s] video series called “Building Scott’s CPU.”
The 11 videos cover everything from basic transistor logic to sequential circuits and moves on to things like ALUs, clock units, and how jump instructions work.
We are guessing there are some more videos forthcoming. However, these 11 videos are over two hours of content and that’s a lot to get you started. Of course, everyone who does this usually focuses on one type of architecture by necessity but there are many ways you can design a CPU. Many homebrew designs are simple multiple clock per instructions designs. A few use pipelining to get an instruction per clock once the pipeline fills up. Modern CPUs do lots of tricks to actually execute, on average, multiple instructions per clock cycle, but that complicates a lot of things.
Then there are the non-traditional architectures like single-instruction computers or asynchronous CPUs. The point is, once you know how a basic CPU works, there is still plenty of room to innovate in your own design.
We’ve been through this exercise more than once and — in our opinion — the hardest part isn’t creating the CPU. It is building all the ancillary tools you need to do anything useful. There are some hacks to make that easier. On the other hand, it is possible to do everything from A to Z.
Another excellent series on the topic by Ben Eater: https://www.youtube.com/watch?v=HyznrdDSSGM&list=PLowKtXNTBypGqImE405J2565dvjafglHU
Seconded. This is a great series.
The long abandoned “Lets Build 8 Bit Computer” is also very helpful:
https://www.youtube.com/playlist?list=PLNUL7DzXzp_J4TztZYOVtayTfp0DV1z5H
Designing one’s own architecture is both surprisingly simple, and hard at once.
It is simple to design a turing complete architecture. Much isn’t needed.
It likewise is simple to design an “in theory” high performance architecture if one sets up unrealistic requirements on instruction completion times, or unrealistically high clock speed. A lot of people get lured away by not thinking about the fact that everything needs to be built in hardware for it to be practical. (as an Intel engineer once said at an anniversary of the 4004, “If you can build a branch prediction unit that is 99.9999% accurate, then it still isn’t of interest unless it also falls within the transistor budget, or at least gets close.” Don’t remember the exact quote though.)
Then it is also really simple to rebuild the 6502 with some minor amendments and call it “one’s own”. Since the 6502 among a few other architectures are quite popular text book examples of how CPUs work. But there is many ways to skin a cat.
I myself have just for the fun of it tested to design very non traditional architectures. Usually focusing on a few aspects of computing and exploring that segment.
Like what happens if one builds a CPU with scratchpad instead of cache, where execution is done exclusively in scratchpad memory while main RAM is manually fetched from, mostly relegating caching to the software level. And this has some pros and cons, main downside is that memory management becomes more important for application performance, likewise there is also a fair bit of overhead associated with software caching, however SRAM is a lot cheaper per MB than cache, so the freed up transistors can be spent on more core performance to compensate for the computing overhead needed.
There is also other aspects of hardware design.
Like how do one handle registers in an out of order architecture executing multiple instructions in parallel. There isn’t one specific best way to do this, the road is rather littered with different approaches with their own pros and cons.
There is also the question of how the decoder should output instruction information to the rest of the core. Here the out of order queue will usually act as a holdover, and register renaming has honestly been the only sane implementation I have yet to figure. Another way is to just sprinkle in extra registers and not have register renaming, and just skip the fact that register contents can end up incorrect from a chronological perspective. Ie, don’t read from a register close to where one writes to it, but this really throws a wrench into serial instruction rate, so unless the code is exceptionally parallel, then this is approach is garbage.
But in general, every architecture follows the same base principal for decoding and executing a function:
1. Read some instruction from somewhere. (usually in main memory somewhere dictated by a program counter storing an address.)
2. Figure out what it means. (Look up table is the simplest approach and even some advanced architectures are this simple.)
3. Fetch the resources needed to execute it. (Data is usually read from registers, or hard coded as part of the instruction, seldom do architectures read instruction data directly from main memory due to access times being long.)
4. Execute the instruction. (x = f(a, b) but more terms can be had if desired, it is your architecture.)
5. Place the result somewhere. (usually back into registers, but throwing data into main memory works too since here latency isn’t really important.)
6. Repeat from 1.
Handling IF statements is more or less just step 4 amending the next 1. Usually by conditionally squeezing in some value into the program counter.
One can of course though make things more complex if desired.
My own hobby architecture sends decoded instructions first into a fetch queue, and then into the main out of order queue, before being processed. Register contents are provided continuously to both queues. The fetch queue is mainly trapping memory reads, letting them get handled there so we can push their data to register. Reason for not having these go into the out of order queue is due to every available queue position on the out of order queue is generally quite resource expensive, since we want to be able to fetch from any available position into any available execution unit (Actually oversimplifying, things are grouped a bit based on function, so called ports.), while also making sure that every queue space can state/access their dependencies, handle instruction prioritization due to branches, as well as a few other performance enhancing features.
In the end, there is many ways to work with homebrew computer architectures.
To a degree, it can be more fun and interesting to look at a problem in a vacuum so to speak, instead of reading about how someone else did it, that one can do later after one has given the problem an iteration or three oneself, since sometimes one might find a better solution for one’s architecture oneself instead of copy pasting something less suitable.
There is a game on Steam called Turing Complete. You start at basic logic gates and work your way up to building and programming a computer. I have no relation to the game other than someone who bought it and found it fun.