Because I often work with students, I’m always on the look-out for a simple CPU, preferably in Verilog, in the Goldilocks zone. That is, not too easy and not too hard. I had high hopes for this 16-bit RISC processor presented by [fpga4student], but without some extra work, it probably isn’t usable for its intended purpose.
The CPU itself is pretty simple and fits on a fairly long web page. However, the details about it are a bit sparse. This isn’t always a bad thing. You can offer students too much help. Then again, you can also offer too little. However, what was worse is one of the modules needed to get it to work was missing! You might argue it was an exercise left to the reader, but it probably should have been pointed out that way.
At first, I was ready to delete the bookmark and move on. Then I decided that the process of fixing this design and doing a little analysis on it might actually be more instructive than just studying a fully working design. So I decided to share my fix with you and look inside the architecture a bit more. On top of that, I’ll show you how to get the thing to run in an online simulator so you can experiment with no software installation. Of course, if you are comfortable with a Verilog toolchain (like the ones from Xilinx or Altera, or even free ones like Icarus or CVer) you should have no problem making that work, either. This time I’ll focus on how the CPU works and next time I’ll show you how to simulate it with some free tools.
Let’s start with a block diagram of the CPU. It isn’t much different from other RISC architectures, especially any that don’t use a pipeline. A program counter (PC) drives the instruction memory. There’s a dedicated adder to add four to the PC for each instruction because each instruction is four bytes. A mux lets you load the PC for the next instruction or with a jump target (actually, an absolute jump, a computed branch, or a return address). There’s another dedicated adder for the computed branches.
The processing occurs in an arithmetic logic unit (ALU) that performs different operations. The destination can be main memory or one of the registers. The register file uses an old trick to avoid a common problem. Suppose you can read one register per cycle. If you only allow one register in an instruction, that’s fine. But if you allow an instruction to do something like add two registers, you’ll have trouble loading both of them unless you stretch out the instruction time. That’s why the register file has two output ports.
The truth is, the register file is at least one spot where the design would not synthesize to real hardware as well as it could. For one thing, there’s a for loop in the initial block to zero the registers. Most synthesis tools would just throw that away. You’d be better off with a reset signal. The other possible issue depends on what exact FPGA you will target and what tools you use.
The designer provides two read ports to the registers, but the underlying storage is the same. This would make it difficult to use specialized RAM cells if they were available. Another common technique is to simply use two separate register blocks, one for each read port. A write will send data to both blocks so from the outside you can’t tell the difference. Frequently, though, this will result in a faster and more compact design.
It would be interesting (and not very difficult) to rewrite the register file to do this. However, if you aren’t going to build down to hardware you probably won’t notice any difference.
Like most similar CPUs, the whole control works out to muxes selecting what data gets sent where. In particular, there are four muxes in the processor’s data path:
- PCSrc – Routes the “next” PC value to the program counter
- RegDst – Selects what register to write from two fields in the instruction (the diagram shows three inputs, but that appears to be an error)
- BSrc – Selects the second argument to the ALU (either an immediate value or a register value)
- WBSrc – The “write back” mux selects what data is set back to the registers for writing
The rest of the design shows the thirteen instructions, the five instruction formats, and the control signals required for each of the formats. The nuances of the instructions in each category depend on what the ALU is set to do. In other words, an add instruction and a subtract instruction are exactly the same except for what the ALU does. As you might imagine, the ALU takes two inputs and an operation code and produces an output.
The original post doesn’t actually say which instructions are in which category, but it is pretty easy to puzzle out. The Load and Store instructions are in the memory access formats. The Branch on Equal and Not Equal instructions are in the branch category. The Jump instruction has its own format. All the other instructions are “data processing.” The one table shows a “hamming distance” op code, but this doesn’t appear anywhere else–including in the code–so I suspect it is a cut and paste error.
The two tables do a good job of summarizing the operations need to make the CPU work. There are nine distinct control signals:
- RegDst – This corresponds to the mux in the diagram of the same name and selects if the destination is a register (shows up as reg_dst in the code)
- ALUSrc – Selects the source of the ALU argument (same as the BSrc mux in the diagram, and shows up as alu_src in the code)
- MemtoReg – Active when a memory to register transfer occurs (mem_to_reg in the code)
- RegWrite – Set when write should go to a register (reg_write in the code)
- MemRead – Set when a memory read is the source data for the instruction (mem_read in the code)
- MemWrite – Set when memory is the write destination (mem_write in the code)
- Branch – Active when a branch is in progress (combination of beq and bne signals in the code)
- ALUOp – Combined with part of the instruction, selects the operation to perform in the ALU (alu_op in the code)
- Jump – Active when a jump is in progress
The table corresponds directly to Verilog in the control unit except for the name changes, which is unfortunate as it makes the table a little harder to follow. For example, here is the code for a data processing instruction with opcode 0010:
4'b0010: // data_processing begin reg_dst = 1'b1; alu_src = 1'b0; mem_to_reg = 1'b0; reg_write = 1'b1; mem_read = 1'b0; mem_write = 1'b0; beq = 1'b0; bne = 1'b0; alu_op = 2'b00; jump = 1'b0; end
Compare that to the table in the original post and you’ll see it maps directly. In English, the instruction is a read from two registers that writes back to the registers with an ALU operation code of 0 and it isn’t a jump or a branch.
Inexplicably, this block is duplicated for all the data processing instructions even though it shouldn’t be necessary. Luckily, for simulation, it won’t really matter and most synthesis tools will figure it out and merge the identical code for you.
In the next installment, I’ll show you how to load the design into one of my favorite quick design tools, EDA Playground. There was a missing file and some massaging necessary to get it to work with the online tool. However, the CPU does work as promised, once you figure out a few peculiarities. If you want a sneak peek at the simulation, you can check out the video, below.