Interview: New Mill CPU Architecture Explanation For Humans

Hackaday had an amazing opportunity to sit down with [Ivan Godard] who discussed the Mill CPU development which his company — Out of the Box Computing —  has been working on for about a decade. The driving force behind Mill development is that optimizations to existing architectures can only get you so far. At some point you need to come up with a new processor that builds on success and failure of its predecessors.

Ivan’s team has put out several lecture videos linked from their site that dig really deep into the inner workings that give Mill an advantage over currently available chips. We covered one of them recently which prompted [Ivan] to reach out to us. But what if you aren’t working on your advanced degree in semiconductor design? Our interview certainly isn’t for the laymen, but any engineering enthusiast should find this a refreshing and delightful conversation. After the jump you can see the first two installments of the four part interview.

If you don’t want to take the plunge on watching the whole thing, start with this greatest hits clip from the interview. It’s around eight minutes and covers three questions. The first is a discussion on how Mill uses and budgets power in a way that mimics DSP chips but still offers the versatility of a traditional processor. The next snippet is a discussion of the usefulness of Mill for things like cryptanalysis. [Ivan] explains that when the crypo algorithm is known, an ASIC will outperform Mill. When tasked with testing to discover patterns which may be used for cracking crypto, though, Mill performs remarkably well. And the final segment of the “greatest hits” reel covers a discussion of the automated build tools the team developed which will dynamically add operations to the Mill. Within about twenty minutes you can see that operation in action via the simulator.

Don’t stop with this tasty teaser or you’ll miss an epic discussion. In fact, the second clip of the four-part interview (don’t worry, they’re all pretty short) includes one of the greatest quotes we’ve heard. [Ivan] is discussing the difference between a software startup and a company that is designing a new processor architecture. He jokes that software companies can be founded with a couple of “hot” guys and a wad of cash, but: “Heavy semi doesn’t work like that. Heavy semi is like steel mills and railroads.” Classic!

Work your way through these interviews today. We’ll be posting a follow-up with the other two parts tomorrow.

48 thoughts on “Interview: New Mill CPU Architecture Explanation For Humans

  1. — Copied from the previous coverage http://hackaday.com/2013/08/02/the-mill-cpu-architecture/#comment-1035542
    Having watched the two videos ( and one of the interviews) here are a few of my thoughts:

    1) As with all of these sorts of projects, it will be interesting to see if they have real-world results which match their theoretical results. Here’s to hoping it works out!

    2) Since the belt length isn’t fixed the number of results you can “address” changes depending on your processor and you’ll have to shove data in and out of scratch more or less frequently if you swap your harddrive into a different machine. It seems like it would make more sense to fix the length of belt at some long but reasonable level for the “general” level they’re aiming for, then essentially make the promise to only increase the length. This would make binary compatibility within a specific generation possible, and then you could increase the belt length when you released a new generation if it enabled some great performance increase.

    3) Putting fixed latency instructions into the set is always kind of scary. I don’t like the idea that if someone invents a signed multiply that runs in one cycle it couldn’t be put into newer versions of the processor without breaking old versions of the code (belt positions wouldn’t match). I get that they’re doing it because you either get stalls or limit belt utilization to some sub-optimal level, but it still seems crappy. I would much prefer stalls to having pipeline-latency-aware compilation.

    4) Toward the end of the decoding section he revealed that they restricted certain types of decode operations to specific decoders. It sounds like this was done to limit the maximum bit size of the instruction set (i.e. instruction 0x1A may mean ADD on one decoder and LD on the other decoder), simplify the decode logic on each specific decoder, and limit the number of interconnects. But I’m curious what this means for instructional blocks: does this mean that you’ll have to interleave the type of operations in each block? If I only perform one type of operation for a given interval does that mean that the other decoder will be idle? Does the compiler have to create an empty instructional block?

    It sounded like the balance was 17 ops in the non-load decoder, so the chances I would have something to the effect of 34 arithmetic operations with 0 loads interleaved is hilariously unlikely, but I’m curious how the case is handled.

    This also means that if something goes wrong and you start feeding the blocks into the wrong decoder the behavior will be incredibly mysterious.

    4) I’m curious what defines the order for belt insertion within a single instructional block (i.e. who wins when I dispatch multiple adds simultaneously). This must be well defined, but I would still like to know. I also did not wholly follow the whole multiple latency register thing, there must be some tagging involved, but ( didn’t hear it mentioned

    I’m sure I missed a few questions I had over the course of the nearly three hours I was watching these, but these were the ones which stuck out when I was done.

    I wish they had finished their patents so that they could answer more of the questions!

    1. Responses by question number:

      1) “Real world” is a long time and many bucks for a CPU. In the meantime, all we can do is to explain how it works and let the reader apply his design and implementation knowledge to figure out the benefit, from first principles rather than from running systems. We ourselves did the same as the design evolved. Many skilled CPU people have been through the Mill, and we’ve come to recognize the response – at first doubt, followed by several “Oh!”s and occasionally a burst of giggles, and eventually “It’s sure different – but you know, it will work, and I could build it!”,

      2) Mill code is portable across family members at the load module level but not at the bit-encoding level. When a program is moved, if the bits for the new platform are not in the cache in the load module then the install step invokes the “Specializer” system software to generate the bit encoding, which is added to the load module cache. We did not invent this system – it’s been used in the IBM AS-400 family, and others, for a long time and works well. Part of the job the specializer does is to deal with belt length variations.

      While ordinary programs port transparently, any program (such as a JIT) that generates binary must know what family member is the target. We provide a member-independent interface to the specializer that a JIT can use, which removes this dependency, but programs that do depend on the exact platform must know the host. One can see this as a security feature – code-mangling exploits that penetrate the Mill security will only work on a single Mill family member. Mill security will be covered in a future talk in the series; sign up for announcements of talks at ootbcomp.com/mailing-list.

      3) The specializer knows the latencies when it generates binary for the current target member. If that member has a one-cycle signed multiply then the code will use it; there will not be any delays. There’s quite a bit of latency variation over even the family members we have configured so far. Indeed, whether an operation even exists in native hardware at all is member-dependent. For example, the Tin low-end member has no floating point hardware. The specializer encodes floating-point operations from the member-independent representation as calls to emulation software (an in-line call, not a trap).

      4) The two halves of an instruction are disjoint in memory; all the A halves are contiguous, and so are all the B halves except someplace else; there is no interleave.

      If the decoders get lost then the bit stream will make no sense and an invalidInstruction fault is in your near future. The same is true when decoders get lost on any machine – on an x86 for example, try coding a jump into the middle of an instruction (not on the correct instruction phase boundary) and see what happens :-)

      The Mill ISA assigns opcodes to the two sides so as to roughly balance the entropy load (number of bits consumed) between the two sides; this optimizes cache traffic. Operations on the “Flow” side include the ones that are physically bigger in the encoding – loads, stores, and LEAs that have to carry the bits of an address mode; flow of control ops with offsets; and the “con” operation that provides big literal constants. Because the individual operations are bigger than things like adds, the Flow side has a smaller part of the opcode space and a smaller typical number of operations per instruction. The Gold member mentioned in the encoding video (ootbcomp.com/docs/encoding) has slots for eight operations on the Flow side and 25 on the exu side, but the bit load is roughly equal in typical code.

      You can ask technical questions at contact@ootbcomp.com. Also, there’s been quite a bit of talk about the Mill on the comp.arch newsgroup; lurking (or contributing) there may be rewarding.

      Ivan

  2. The system he describes sounds a bit like the ARC processor approach from 10 years ago: The ability to specify custom instructions and automatically generate software tools and hardware system. Although, the CPU architecture is likely rather different for the Mill CPU.

    1. Sounds like you’re referring to the ARC cores from Synopsis, correct? (Located here: http://www.synopsys.com/IP/ProcessorIP/Pages/default.aspx) These are interesting for embedded applications with some dedicated DSP capability on-chip. I believe these are scalar architectures with some added DSP functions intended for things like accessing two operands at a time (their so-called XY architecture), single and dual channel memory access, address generators, and some dedicated DSP instructions. Still I wonder how these differ (and I’ll need to read-up on them more before I comment) from what’s being proposed here? Seems to me that they have really taken the approach of having DSP features ‘sitting shotgun’ rather than implementing DSP-like capability across the whole fetch & execute cycle for all instructions. Still, they are suggesting some pretty impressive power-performance for embedded applications. I wonder how that scales across general purpose computing? I’ll need to dig a bit deeper.

  3. Wow, another CPU design that “Works in RTL simulations” and has good performance in “RTL simulations”. I’ve got news for you. ANYTHING works in RTL.

    Want to multiply in 1 clock cycle at 100Ghz, guess what, it works in RTL!

    Want to decode 10 levels of logic in RTL at 100Ghz, well, that works too!

    Lets see what performance this gets at say 65nm, a commonly available process node thats not too expensive to manufacture in.

    Let me know when it has a compiler, a debugger, an operating system and other things any modern cpu needs. Let alone a memory interface, DMA, IRQ, and other little details.

    1. Not sure I follow. Firstly – he didn’t mention RTL sim, he mentioned sim. I won’t speak for Ivan but that likely includes a host of different levels of modeling – including RTL but also behavioral and I’d suspect mathematical modeling. And this is a completely – and without-a-doubt 100% necessary – step in developing anything for silicon. Yes, you can create a lot of stuff that doesn’t transfer to the real world, just as you can 3D print something you can’t injection mold or CNC, but to discount ‘things working in simulation’ is hasty…As someone whose done a fair bit of simulation, it’s an absolutely necessary step in securing funding to build a device. How else would you propose raising the sorts of $$$ needed to bootstrap a semiconductor startup? Drawing pictures? PPT Slides? Microarchitecture diagrams? The deeper technical overview (and yes, I watched it patiently and admittedly skeptically) does a good job of dispelling my fears that this was “Just another gaggle of guys touting the next big CPU”.

      One thing I sensed in the discussion with both gentlemen I met with was just how deep their knowledge of this area was and how pragmatic they were. Someone commented on how long they’ve been working on the concept and that’s fair. But I’d suspect that’s a reflection as much of picking something up and putting it down, then up, then down… as anything else. ‘Designing a CPU is not building a weekend web startup’ was the line (if you don’t my paraphrasing a bit). Especially when you have to factor in that so much legacy code has to run on that CPU. That was a comment the guys made during the interview was that every ounce of code they test has to come from the world as it is / was. You shouldn’t expect C programmers to change their coding style adopting this device any more than you would expect them to do so moving between x86 and DSPs.

          1. I’ll concede – like any “futurist” – he’s made a bunch of predictions that didn’t necessarily come true (and some downright weird), but his work on OCR, TTS, in Music, AI, and a huge range of other things are all worthy of praise (enough to get him hired by Google!). And his list of books,awards, grants, etc. is downright impressive.

      1. Patents are the first step in locking away knowledge and hampering human advancement. If you care more about technology than ego or moreny, open up everything and let the whole world move forward.

  4. Will Parts 3 & 4 talk about the technology itself? So far there’s only been discussions about the processor market. I was hoping for a technical overview a little shorter and concise than the multi-hour presentations they gave at Google.

    They don’t have it running in an FPGA yet? If so, I think any discussion is probably premature. Parallax has FPGA images of the still-in-development “Propeller 2” available to anyone brave enough to try it, providing a real test environment for people to get started playing with the system. Granted, Chip Gracey is an amazingly talented guy, but I think that if this Mill CPU idea has merit they should be able to get it running on an FPGA similarly instead of spending time talking about it.

      1. It’d prove the thing actually works, though. I’d be surprised nowadays if people made full-custom silicon without testing it in an FPGA first.

        Often chip makers give out demo hardware that only works at a fraction of the speed, to potential users, games consoles are one big example of that. Especially if you know exactly what fraction of the design speed your kludge is working at, you can test lots of possible uses and benchmark them proportionally.

      2. While testing in FPGA (or things like the Palladium emulation system from Cadence) cannot allow direct power measurements (to extract uW/MHz), it will demonstrate a working implementation (i.e. that it is possible to build an entire processor) and allow extraction of performance figures in an ASIC-like setup (“MIPS/MHz”). (While it’s possible to run with real code in simulation – to perform something as simple as booting an OS, will take days to complete so faster platforms, such as FPGAs, are ideal for testing larger code runs.)
        Approximate power analysis can be done post-layout using real cell libraries but it really needs an ASIC to be made, which, if they’ve been working on this for 10 years, should have been possible by now.

    1. they’ve held several lectures including some at google, ibm, and TI regarding more technical aspects of the mill architecture- if i recall correctly one of the presentations at google lasted about an hour and covered primarily ‘the belt’ as they call it, but also several different functions, a quick youtube search turned up this link: http://www.youtube.com/watch?v=QGw-cy0ylCc there were several other such talks

  5. i recall quite some time ago spending hours reading over and watching lectures on the mill architecture- it’s really quite fascinating stuff- and while ‘the belt’ isn’t entirely new/original it’s still seems to have it’s merits. Definitely worth the time.

  6. “The first is a discussion on how Mill uses and budgets power in a way that mimics DSP chips but still offers the versatility of a traditional processor.”

    Reminds me of Texas Instruments DSPs, C6000 and C6000 families are pretty good at “traditional processor oriented” tasks.

    Looks like a really interesting topic, I’ll have to make room to watch the videos.

  7. Decades ago there were postulates of someday using the whole wafer for system-on wafer. SoC arguably devolves for those discussions. FPGA as “routing” between Processors,I/O and Memory might leverage these evolving architectures by empowering swifter iterations of trials.

    Build a reconfigurable test bed on a wafer so untried designs like this can be wrung out quicker.

    NOTHING substitues for running on live Silicon.

  8. Oh god Trimedia… it was awful I knew people that worked there and they would agree, VLIW was nearly impossible to program and writing the compilers was virtually impossible. I can’t imagine they are trying this again, I would take EVERYTHING said here with a grain of salt.

Leave a Reply to Matt BerggrenCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.