The Mill CPU architecture

belt

There are basically two ways to compute data. The first is with a DSP, a chip that performs very specialized functions on a limited set of data. These are very cheap, have amazing performance per watt, but can’t do general computation at all. If you’d like to build a general-purpose computer, you’ll have to go with a superscalar processor – an x86, PowerPC, or any one of the other really beefy CPU architectures out there. Superscalars are great for general purpose computing, but their performance per watt dollar is abysmal in comparison to a DSP.

A lot of people have looked into this problem and have come up with nothing. This may change, though, if [Ivan Godard] of Out-of-the-Box computing is able to produce The Mill – a ground-up rethink of current CPU architectures.

Unlike DSPs, superscalar processors you’d find in your desktop have an enormous amount of registers, and most of these are rename registers, or places where the CPU stores a value temporarily. Combine this with the fact that connecting hundreds of these temporary registers to places where they’ll eventually be used eats up about half the power budget in a CPU, and you’ll see why DSPs are so much more efficient than the x86 sitting in your laptop.

[Ivan]‘s solution to this problem is replacing the registers in a CPU with something called a ‘belt’ – basically a weird combination of a stack and a shift register. The CPU can take data from any position on the belt, perform an operation, and places the result at the front of the belt. Any data that isn’t used simply falls off the belt; this isn’t a problem, as most data used in a CPU is used only once.

On paper, it’s a vastly more efficient means of general purpose computation. Unfortunately, [Ivan] doesn’t quite have all the patents in for The Mill, so his talks (two available below) are a little compartmentalized. Still, it’s one of the coolest advances in computer architecture in recent memory and something we’d love to see become a real product.

Comments

  1. BartGrantham says:

    Wow. This writeup… wow.

    First off, there’s either many ways to compute data if you want to talk architecture, or just one: a turing machine. How many ways to compute is a matter of what level of detail we’re discussing.

    Superscalar has nothing to do with CPU’s vs. DSP’s. There really isn’t a hard distinction between the two, and if there was it wouldn’t be having superscalar features. There are DSP’s with superscalar processing, for example.

    I’ll watch the videos, but my kneejerk reaction after reading this writeup is that you’ve been taken in by crank with a new weird architecture.

    • Th3BadWolf says:

      This “belt” kinda look like a bit shift right/left if you want my guess. It’s a cool way to implement it and I’ve never actually seen in in processors but hey,no need to rename a well known process =P

      • Alex says:

        The belt is a concept, it is not an implementation. One could perhaps implement it using a large shift register, but that would be inefficient and it is now how it is implemented in the Mill.

        “=P”

  2. Jake Mercer says:

    2 hours of footage! That’s my night sorted! This sounds like it’s gonna be interesting.

  3. Obakemono says:

    Well, our x86 processors don’t have hundreds of registers, a modern i7 have 8 registers in 32 bits mode and 16 registers in 64 bits mode.

    The design of the modern processors should be a commitment between amount/complexity of fuctional units and control unit complexity. In the video the say they talk about decoding in other conference i’m looking forward to see, so, despite this been a neat hack it sure adds some complexity to the CU.

    It could be interesting to try, but i don’t thing this is going to be a big big iprovement, maybe interesting for mobile devices as it seems to have a nice cost/calculation.

  4. Rob says:

    Holy crap, it’s like a geek version of Willie Nelson! Seriously though, what a fascinating idea!

    • Chris C. says:

      Heh. I was thinking Santa Claus.

      I’m pretty sure I’ve seen the same scheme used for optimizing some kind of virtual machine, at least from the brief description. But I’m not hardcore into cores enough to watch two hours of video! At least now I’ll know who’s responsible if the design ever hits the market.

  5. six677 says:

    there are so many errors in that article. It should say DSP or CPU, not DSP or superscalar. Non superscalar CPU’s are still perfectly capable of running regular software etc. Superscalar is just a technique which can increase the throughput of a CPU, it is not a fundamental thing that all general purpose CPUs are superscalar (however most modern ones are)

    • David says:

      Agreed, the number of errors per sentence is amazing. For one thing, of course you can do general computation on a DSP. Also, we have heard of stack architectures before. They are slow, which is why all successful CPU architectures have register files. However, cranks love them…

      • Dan says:

        So is this guy’s stuff not revolutionary? Can you elaborate on why/why not?

        • Ren says:

          Well, some EE PhD. (candidate) students are required to design and implement a new CPU architecture, (or at least implement a working subsystem of that CPU), So there are probably “hundreds” of “revolutionary” designs out there, just sitting on breadboards, prototype silicon (or germanium or gallium arsenic wafers) and documented in Manila folders, having been abandoned after the Degree is awarded.

          In the past Math PhD.s were required to invent a new “Math” for their degree.

        • Alex says:

          David doesn’t know what he’s talking about. It’s not a stack. The instruction operands aren’t pre-defined (top of stack and next on stack), they must be specified as positions on the belt. The second video explains it much better than I can. If you’re impatient, skip to 17m or so where he starts explaining the problem with having lots of registers and what they do to solve it.

      • Nicholas says:

        What is currently better for performing either of the operations discussed in the article except DSPs and superscalar type CPUs? (respectively)

      • vpoko says:

        Stack based architectures are also good for virtual machines since it’s easier to map a virtual stack to native a register set but harder to map a number of virtual registers to a different number of native registers.

  6. Bj says:

    Sounds like another stack based CPU.

    • Munch says:

      Did you watch the video? You’d know that a stack expands and contracts, while the belt simply treads along. It’s closer in concept to a transport-triggered architecture than it is to a stack architecture.

  7. roboman2444 says:

    “so much more efficient than the x86 sitting in your laptop.”
    implying i have a laptop
    implying i use x86 processors

  8. lol says:

    Show me the cheap DSP’s please..
    Just quit writing articles, you are just the worst person for this job..

  9. ataradov says:

    Something similar has been around for years – counterflow pipeline. Never took off, but on paper it has much better performance as well. The problem with taking and placing data on the belt is that when you turn it into the actual chip it results in a lot of slow logic.

  10. Tom says:

    I don’t know why… but reading this made me angry.

  11. Anne Nonymous says:

    Ah yes, on paper it would be faster. On what workloads? How about running a modern OS? For which compilers?

    Sounds like another pie-eyed RTL designer thinking they know everything.

    The RTL is done! It simulates! everything is perfect! In wonderful simulation land, everything is always perfect. Everything is always fast, everything always meets imaginary timing.

    Sure, I can run at 4Ghz! It passes RTL simulations!

    I’ll skip all the practical problems that this wonderful design has, and leave it in the imaginary happy land where everything always works.

  12. Erik says:

    I feel compelled to set the historical record straight – people have done general purpose computing on DSPs at least since 1991. They were being used then to control the relatively new-fangled tunelling electron microscopes. Or for a computational physics application, Google “QCDSP” – the forerunner to QCDOC, later renamed “Blue Gene” by IBM.

  13. Anonymous says:

    Ok, the editor is clearly not qualified to talk about cpu technology. It’s ok, usually hack-a-days editors seem very competent, but cpu technology is a bit of a special field, and it’s no wonder even hack-a-day struggles. I kinda felt the same reading the write up how I usually feel when reading technology or science news in general media.

  14. Galane says:

    This seems awfully similar to how the TI-99/4A handled registers in RAM.

  15. ada says:

    “Thou shalt not pass” gandalf

    From an architecture viewpoint the packing complexity would simply flip the space time relationship.

    From a functional viewpoint, how is re-patenting Transputer algorithms ethical?
    http://en.wikipedia.org/wiki/Transputer

    From a comedic perspective, its funny seeing the google crowd going “WTF?” at leaving the details to hardware designers. Patents are dumb, as anything thats truly valuable is stolen anyways…

    The DSP chips cannot be classified as a separate architecture these days, and this is especially true for some multicore MIPs machines.

    • Sweeney says:

      Sorry, what they are patenting is nothing like the Transputer. They make no bones about using existing ideas and technologies, but have come up with their own way of implementing things that is patentable (don’t forget that patents are about “a device/method to do X” rather than the basic idea of X).
      The Belt for example is only how programmers see it, under the hood it’s based on a fusion of registers and CAM. When you call a subroutine/function you get a new Belt, containing ONLY the passed parameters. When you return you can return more than one result to the original belt. Any operation that was in mid flight before the call drops its results AFTER the return.
      I’m not sure how it manages things like context switches, but a general problem like the Unix Word Count utility is quoted as running at 25 operations and 1 input character read per clock, so that’s 1.2 GB/sec (if you can feed it data that fast).

  16. Logan says:

    First of all, you all need to chill about the article. In-depth knowledge of CPU design is very rare and even if he spent a few hours researching, he’d still likely come up with errors due to misinformation/dated terminology/general unfamiliarity. Cut him some slack. I wouldn’t envy his position.

    That said, I watched the second video and found it very informative. You all have been a very hypercritical bunch about it, but it looks promising. Not sure it will succeed in becoming much of anything, but from what I can tell it really diverges and brings something new to the table.

    • Amaury says:

      I enterely agree with Logan: most of comment seem to be from people who just the read the misleading article and didn’t care to watch the video. The videos are very interesting and if you know how a CPU works you can only admit that the guys knows what he is talking about. Sure everything is not new and he is the first to admit it so what’s the point of being so hateful as “yet another XX-like cpu”, you are not being better than the editor here.
      I think it is promising, even though it is nowhere near production and I’m eager to see other videos. Wait and see.

  17. J Howe says:

    >Combine this with the fact that connecting hundreds of these temporary registers to places where they’ll eventually be used eats up about half the power budget in a CPU

    I do CPU design for a living. The rename register file is simply a memory, and consumes little power. “bypassing” registers does consume some power, but the big spenders are often the caches and ALU.

    Also power is not the same as energy, and most people now days care about battery life – energy. It’s generally better for energy to have a out-of-order or otherwise faster core that might take more power than a ‘simpler’ lower power architecture that runs longer (as the former can be powered down sooner).

    The presentations I can find focus on the ALU and instruction decode, but frankly these don’t mater much, for general purpose applications the bottleneck is memory latency, not how the ALU etc work. That’s one of the prime reasons why DSPs do better – they can pipeline memory references, which is usually impossible with general code. Out of order (along with prefetching) cuts some of that latency out, which is why it is in common use.

    Since 1980’s we’ve known how to scientifically evaluate new architectures – take industry accepted benchmarks, compile them onto a model of the proposal, and see the performance and energy use. I couldn’t find these numbers.

  18. It sounds good.
    I think it’s better to let users experiment this processor architecture via FPGAs if it’s really as good as on paper. :-)

  19. Kit Scuzz says:

    Having watched the two videos here are a few of my thoughts:

    1) As with all of these sorts of projects, it will be interesting to see if they have real-world results which match their theoretical results. Here’s to hoping it works out!

    2) Since the belt length isn’t fixed the number of results you can “address” changes depending on your processor and you’ll have to shove data in and out of scratch more or less frequently if you swap your harddrive into a different machine. It seems like it would make more sense to fix the length of belt at some long but reasonable level for the “general” level they’re aiming for, then essentially make the promise to only increase the length. This would make binary compatibility within a specific generation possible, and then you could increase the belt length when you released a new generation if it enabled some great performance increase.

    3) Putting fixed latency instructions into the set is always kind of scary. I don’t like the idea that if someone invents a signed multiply that runs in two cycles it couldn’t be put into newer versions of the processor without breaking old versions of the code (belt positions wouldn’t match). I get that they’re doing it because you either get stalls or limit belt utilization to some sub-optimal level, but it still seems crappy. I would much prefer stalls to having pipeline-latency-aware compilation.

    4) I’m curious what defines the order for belt insertion within a single instructional block (i.e. who wins when I dispatch multiple adds simultaneously). This must be well defined, but I would still like to know. I also did not wholly follow the whole multiple latency register thing, there must be some tagging involved, but it wasn’t explicitly mentioned so I remain in the dark.

    5) It’ll be interesting to see what happens with the presentations on protection, exception handling, memory hierarchy, etc. I wish they had finished their patents so that they could answer more of the questions!

    I’m sure I missed a few questions I had over the course of the nearly three hours I was watching these, but these were the ones which stuck out when I was done.

    • Kit Scuzz says:

      Oh! I remember one of my other comments: toward the end of the decoding section he revealed that they restricted certain types of decode operations to specific decoders. It sounds like this was done to limit the maximum bit size of the instruction set (i.e. instruction 0x1A may mean ADD on one decoder and LD on the other decoder), simplify the decode logic on each specific decoder, and limit the number of interconnects. But I’m curious what this means for instructional blocks: does this mean that you’ll have to interleave the type of operations in each block? If I only perform one type of operation for a given interval does that mean that the other decoder will be idle? Does the compiler have to create an empty instructional block?

      It sounded like the balance was 17 ops in the non-load decoder, so the chances I would have something to the effect of 34 arithmetic operations with 0 loads interleaved is hilariously unlikely, but I’m curious how the case is handled.

      This also means that if something goes wrong and you start feeding the blocks into the wrong decoder the behavior will be incredibly mysterious.

  20. Analog says:

    First of all, to spite all the hate received earlier in this article, i personally really enjoyed both talks and appreciate this article, which got me thinking- one of the reasons I always come back to HaD. Thanks Brian.

  21. Vasileios Anagnostopoulos says:

    I hope when it is released it is combined from the start with TIs octa-core DSPs in a power efficient performant combination.

  22. wm tanksley says:

    Great article and video. A lot of people here are making some bad assumptions based on the article due to their own lack of knowledge — I’m guessing they didn’t watch the video.

    I agree with the ones who don’t know whether it’ll actually work.

    This actually reminds me of a 1-core GA144 chip (which has 144 cores); it has a hardware stack which is like a belt, except that ALU only and always reads the top 2 items on the stack (unlike the belt, where you can read any element).

  23. Stefan says:

    A man talked about zombies pushing and moving some suspicious items toward the north of the town.
    However, it is possible to infer some information about
    the TESO economy just from the small bits of information being released onto
    the web. It is not so much that they want to help improve Ibo civilization,
    but they actually believe the Ibo are inferior and that their entire culture needs to be erased and then rebuilt in the Christian model.

  24. it is not FiFo it is FiLo. thats why it is new.

  25. I enjoyed this … but why no reference to the original computers EDVAC and EDSAC, which used mercury delay lines for storage, and thus had a “belt” (programmers had to “time” their code to pick up the right value as it flew past).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 92,253 other followers