Now The V In RISC-V Stands For VRoom

Hundreds of variations of open-source CPUs written in an HDL seem to float around the internet these days (and that’s a great thing). Many are RISC-V, an open-source instruction set (ISA), and are small toy processors useful for learning and small tasks. However, if you’re [Paul Campbell], you go for a high-end super-scalar, out-of-order, speculative, 8 IPC monster of a RISC-V CPU known as VRoom!.

That might seem a bit like word soup to the uninitiated in the processor design world (which is admittedly relatively small) but what makes this different from VexRISC is the scale and complexity. Rather than executing one instruction at a time sequentially, it executes multiple instructions, completing them concurrently in whatever order it can handle. The VexRISC chip is a good 32-bit modular design that can run Linux. It pulls a solid 1.57 DMIPS/MHz with everything turned on. The VRoom already clocks in at mighty 6.5 DMIPS/MHz, with more performance gains. It peaks at 8 instructions every clock cycle with a dual register file and a clever committing system to keep up.

VRoom is written in System Verilog to leverage Verilator (a handy linting and simulation framework), and while there is some C that generates different files, we’d wager it is pretty run-of-the-mill compared to a TypeScript based project. VRoom currently boots Linux thanks to an AWS-FPGA instance (a Xilinx VU9P Ultrascale), though it has to be trimmed to fit. [Paul] has big plans working his way up to a server-class chip with lots of cores and a huge cache.

It’s all on GitHub under a GPLv3 license; go check it out! [Paul] also has a talk with lots of great details. If you’re interested in getting into RISC-V but a server-class isn’t your speed, we heard Espressif is starting to use RISC-V cores in their ever-popular ESP series.

24 thoughts on “Now The V In RISC-V Stands For VRoom

  1. Paul Campbell here – AMA

    A minor nit – I’m currently building in Verilator rather than Icarus (I’ve started using System Verilog Interfaces) – I’d love to still be using Icarus (I’m a big fan of ‘X’ values in simulations for finding bugs).

    I released some minor changes (noted in the blog) DMips/MHz is up to 6.5, I’m in the middle of architectural tuning, I expect it to continue to increase

    1. I quite liked your explanation of why x values are handy for both verification and synthesis. Good luck finding funding/hardware to keep developing this on.

      I wonder if there’s any good open source tooling that splits a design across multiple FPGAs. The stuff in familiar with is all proprietary and professionally I switched to firmware a few years ago.

      1. I think that splitting something across chip boundaries is a tough problem, if you want real speed you likely need to pipe-line the die crossings with flops at each end, and to minimise the number of wires – probably this is something you are going to need to actually build into your architecture

      1. The big thing I’m after is arbitrary sized arrays of structured things – if I could pass arrays of structures I’d be happy – the main problem I’m trying to deal with using interfaces is how to pass N instances of an interface that can be parameterised at compile time (for example building a system that with 6 address units with a 6-read-port TLB for simulation, and a 4 address unit system with 4 TLB read ports for the FGA testing environment) – this design is heavily parameterised so I can do quick architectural exploration, but simple verilog makes that hard in some areas

    2. Thank you! Made a slight tweak to reflect this. There was a ~ in front of 6.5 in your notes, so I didn’t want to misquote you if you were still playing around with it. The Icarus is just a mistake on my part, sorry.

      1. No problem – since I’d just written a note about having to give up Icarus I wanted to give credit where it’s due – Icarus is still a great simulator. The ~6.5 (really 6.49) was only announced in yesterday’s blog post – it’s still a moving target – next big change will be a couple of weeks out.

  2. That’s pretty cool! Do you see this going into FPGAs or ASICs more? The picture you have inside a Xilinx/AMD FPGA is pretty cool too. That seems pretty performant vs arm cores, and love that it’s open source.

    1. It’s rather large for FPGA use. You’d want a small and efficient core design for an FPGA project, not one that takes up most of a 2.5M LE Virtex (in cut down form, for testing).

    1. Out of order execution and speculative execution aren’t mutually inclusive.

      Out of order is about executing instructions that has all their data available.

      While speculative execution runs instructions that one has yet to determine if one should run or not, typically in reagards to branches and branch prediction.

      Though, a lot of the issues with speculative execution is not about executing a branch that shouldn’t have been taken. But rather the fact that a lot of CPUs just didn’t check if the thread were allowed to read what it asked for to start with in this edge case scenario, likely for overall performance reasons.

      1. Ok, and? That’s not relevant to this discussion. We know that Vroom has speculative execution. It’s stated in the 3rd sentence on this article, and in the side blurb on the Vroom website.

    2. The security problems were with speculative fetches not checking for access permissions before the fetch. Speculative execution isn’t bad per say, you just need to ensure that security is observed in the correct sequence also.

      1. There’s more to it than just not speculating past privilege (though that’s important) you can potentially leak information by doing clock-level timing of how long things took, and therefore discover whether or not something is in a cache – or whether a test in another privilege level succeeded or not (did it hit in the BTC?) that leaks a bit of data (it’s why VRoom! spends the gates to have multiple BTCs for each priv mode)

    3. There’s actually a slide on that in the architectural talk – it’s particularly important if you’re aiming for something really big running VMs.

      It’s a REALLY hard problem, I’ve been able to learn from others’ mistakes – for example we wont speculate past a TLB miss (or fault) and fill a cache line. Those things tend to leave performance on the floor. The other thing I’ve been doing is trying to muddy the signal – caches with lots of associative sets, or even fully associative, random replacement algorithms (again that leaves a little performance behind) – RISCV has an architectural cycle accurate counter removing access to that for VMs and/or just user mode is another step in that direction (I think there’s a move for a standard way to do this).

      These are just some of the stuff I’ve been doing – it’s a continual issue

Leave a Reply to Alexander WikströmCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.