Hundreds of variations of open-source CPUs written in an HDL seem to float around the internet these days (and that’s a great thing). Many are RISC-V, an open-source instruction set (ISA), and are small toy processors useful for learning and small tasks. However, if you’re [Paul Campbell], you go for a high-end super-scalar, out-of-order, speculative, 8 IPC monster of a RISC-V CPU known as VRoom!.
That might seem a bit like word soup to the uninitiated in the processor design world (which is admittedly relatively small) but what makes this different from VexRISC is the scale and complexity. Rather than executing one instruction at a time sequentially, it executes multiple instructions, completing them concurrently in whatever order it can handle. The VexRISC chip is a good 32-bit modular design that can run Linux. It pulls a solid 1.57 DMIPS/MHz with everything turned on. The VRoom already clocks in at mighty 6.5 DMIPS/MHz, with more performance gains. It peaks at 8 instructions every clock cycle with a dual register file and a clever committing system to keep up.
VRoom is written in System Verilog to leverage Verilator (a handy linting and simulation framework), and while there is some C that generates different files, we’d wager it is pretty run-of-the-mill compared to a TypeScript based project. VRoom currently boots Linux thanks to an AWS-FPGA instance (a Xilinx VU9P Ultrascale), though it has to be trimmed to fit. [Paul] has big plans working his way up to a server-class chip with lots of cores and a huge cache.
It’s all on GitHub under a GPLv3 license; go check it out! [Paul] also has a talk with lots of great details. If you’re interested in getting into RISC-V but a server-class isn’t your speed, we heard Espressif is starting to use RISC-V cores in their ever-popular ESP series.
Paul Campbell here – AMA
A minor nit – I’m currently building in Verilator rather than Icarus (I’ve started using System Verilog Interfaces) – I’d love to still be using Icarus (I’m a big fan of ‘X’ values in simulations for finding bugs).
I released some minor changes (noted in the blog) DMips/MHz is up to 6.5, I’m in the middle of architectural tuning, I expect it to continue to increase
Thanks for highlighting you’re keeping a blog.
I’d love to subscribe, but I cannot find RSS or Atom feed for it.
Rats I thought I’d fixed that – I’ll go and poke at it a bit more tomorrow – sorry
I quite liked your explanation of why x values are handy for both verification and synthesis. Good luck finding funding/hardware to keep developing this on.
I wonder if there’s any good open source tooling that splits a design across multiple FPGAs. The stuff in familiar with is all proprietary and professionally I switched to firmware a few years ago.
I think that splitting something across chip boundaries is a tough problem, if you want real speed you likely need to pipe-line the die crossings with flops at each end, and to minimise the number of wires – probably this is something you are going to need to actually build into your architecture
You could try SV unions. They are about as good as interfaces. Modports are nice in theory, but generally not worth the hassle.
The big thing I’m after is arbitrary sized arrays of structured things – if I could pass arrays of structures I’d be happy – the main problem I’m trying to deal with using interfaces is how to pass N instances of an interface that can be parameterised at compile time (for example building a system that with 6 address units with a 6-read-port TLB for simulation, and a 4 address unit system with 4 TLB read ports for the FGA testing environment) – this design is heavily parameterised so I can do quick architectural exploration, but simple verilog makes that hard in some areas
Thank you! Made a slight tweak to reflect this. There was a ~ in front of 6.5 in your notes, so I didn’t want to misquote you if you were still playing around with it. The Icarus is just a mistake on my part, sorry.
No problem – since I’d just written a note about having to give up Icarus I wanted to give credit where it’s due – Icarus is still a great simulator. The ~6.5 (really 6.49) was only announced in yesterday’s blog post – it’s still a moving target – next big change will be a couple of weeks out.
Have you heard of the work being done by LibreSoC?
That’s pretty cool! Do you see this going into FPGAs or ASICs more? The picture you have inside a Xilinx/AMD FPGA is pretty cool too. That seems pretty performant vs arm cores, and love that it’s open source.
This is more something that one would build as an ASIC – for this sort of things FPGAs are more a tool to get lots of testing done – it really needs actual datapaths and an ASIC
It’s rather large for FPGA use. You’d want a small and efficient core design for an FPGA project, not one that takes up most of a 2.5M LE Virtex (in cut down form, for testing).
I wondered whether, because you’re starting from the ground up, you were able to avoid the security exposures in speculative execution.
Out of order execution and speculative execution aren’t mutually inclusive.
Out of order is about executing instructions that has all their data available.
While speculative execution runs instructions that one has yet to determine if one should run or not, typically in reagards to branches and branch prediction.
Though, a lot of the issues with speculative execution is not about executing a branch that shouldn’t have been taken. But rather the fact that a lot of CPUs just didn’t check if the thread were allowed to read what it asked for to start with in this edge case scenario, likely for overall performance reasons.
Ok, and? That’s not relevant to this discussion. We know that Vroom has speculative execution. It’s stated in the 3rd sentence on this article, and in the side blurb on the Vroom website.
I believe the point here was that it’s out-of-order execution rather than speculative execution.
But it is speculative execution. This is explicitly said many times. At not point was Vroom being capable of speculative execution a question.
The security problems were with speculative fetches not checking for access permissions before the fetch. Speculative execution isn’t bad per say, you just need to ensure that security is observed in the correct sequence also.
There’s more to it than just not speculating past privilege (though that’s important) you can potentially leak information by doing clock-level timing of how long things took, and therefore discover whether or not something is in a cache – or whether a test in another privilege level succeeded or not (did it hit in the BTC?) that leaks a bit of data (it’s why VRoom! spends the gates to have multiple BTCs for each priv mode)
There’s actually a slide on that in the architectural talk – it’s particularly important if you’re aiming for something really big running VMs.
It’s a REALLY hard problem, I’ve been able to learn from others’ mistakes – for example we wont speculate past a TLB miss (or fault) and fill a cache line. Those things tend to leave performance on the floor. The other thing I’ve been doing is trying to muddy the signal – caches with lots of associative sets, or even fully associative, random replacement algorithms (again that leaves a little performance behind) – RISCV has an architectural cycle accurate counter removing access to that for VMs and/or just user mode is another step in that direction (I think there’s a move for a standard way to do this).
These are just some of the stuff I’ve been doing – it’s a continual issue
Really impressive work!! The blog is fascinating :-)
It’s a good thing he was able to rent time on a VU9P. Just the chip alone is over US$60,000.
and it’s only $1.5 an hour – $60k is probably too high in reality you can buy boards with VU9P sized chips for ~$5-10k – when the bitcoin bubble busts they’ll likely be at reasonable prices for mere mortals