Weird Processing Unit Only Has 4 Instructions

[Tomáš], a.k.a. [Frooxius] is playing around with computational theory and processor architectures – a strange hobby in itself, we know – and has created the strangest CPU we’ve ever seen described.

The Weird Processing Unit, or WPU, isn’t designed like the Intel or ARM CPU in your laptop or phone. No, the WPU is a thought experiment in computer design that’s something between being weird for the sake of being weird and throwing stuff at the wall and seeing what sticks.

The WPU only has four instructions, or attoinstructions, to change the state of one of the 64 pins on the computer – set to logical 1, set to logical 0, invert current state, and halt. These instructions are coded with two bits, and the operand (i.e. the wire connected to the computer) is encoded in another six bits.

These 64 wires are divided up into several busses – eight bit address and control busses make up the lowest 16 bits, a 32-bit data bus has a function akin to a register, and a 16-bit ‘Quick aJump bus’ provides the program counter and attocode memory. The highest bit on the WPU is a ‘jump bit’, implemented for unconditional jumps in code.

We’re not even sure the WPU can even be considered a computer. We realize, though, that’s probably not the point; [Tomáš] simply created the WPU to do something out of the ordinary. It’s not meant to be a real, or even useful, CPU; it’s simply a thought experiment to see what is possible by twiddling bits around.

Tip ‘o the hat to [Adam] for sending this one in.

34 thoughts on “Weird Processing Unit Only Has 4 Instructions

  1. I still don’t have my head around this concept, but wouldn’t a 4 instruction processor, have an advantage over a 1 instruction processor? Advantage in programming (less jumps? more control) yet, not as, oh how do I say it?, less confusing as CISC?

    1. Not really as it depends on what the instruction(s) do(es). As one example a 3 address instruction set with 3 instructions can be relatively easy to program compared to this 4 instruction machine.

      (operation/memory width is 1 bit)
      NAND dst, src1, src2
      mem[dst]=mem[src1] nand mem[src2]

      SKIP0 src1
      if mem[src1]=0 skip next instruction

      BRA target
      branch to target address

      Even if the system state except the program itself is unknown at the start it is pretty easy to program as a general purpose machine.

      Not: NAND dst, src, src

      And: NAND tmp, src1, src2
      NAND dst, tmp, tmp

      Or: NAND tmp1, src1, src1
      NAND tmp2, src2, src2
      NAND dst, tmp1, tmp2

      Xor: NAND tmp1, src1, src2
      NAND tmp2, src1, tmp1
      NAND tmp1, src2, tmp1
      NAND dst, tmp2, tmp1

      Some PLC systems in the real world have been just a bit more complicated.

    1. It’s not that much about the number of instructions, as the architecture in the whole. The description of the architecture is a bit ripped out of the context (doesn’t mention other important parts).

  2. If you can code an AND-or-Invert function with it, it’s Turing complete. If you can’t, it isn’t.

    A quick scan of the documentation doesn’t show anything that looks like a conditional, though there’s some funkiness in the interaction between the address and control buses that might do the job. If ‘look at the data and decide what to do’ has been abstracted that far away from anything immediately visible, coding an adder or multiplier will be Fun in the Dwarf Fortress sense of the term.

    1. Sounds vaguely like Marvin Minsky’s cellular automaton Turming Machine. IIRC the later version of that had something like 7 states.

      You only need about four concepts to get Turing completeness: substitution (if you see pattern A replace it with pattern B), representation (pattern A and pattern B are ‘equal’ in the sense that you can replace either one with the other), scope (rule R only applies within defined limits), and persistence (very loosely, the ability to put information somewhere, forget it, then get it back again later).

      If the substitution, representation, and scope rules are stored in memory the machine can write to, you have a Turning machine.

  3. When I started working in industrial automation more that 20 years ago, my company couldn’t use commercial PLCs because they were too slow for their line of business (ceramic tiles), so the machines were “programmed” with 74XX chips.
    Then they developed an in-house PLC, which had a single one bit register and 5 instructions (load, store, and, or, jump), the “programmer” was a suitcase with a keypad to key in those instructions, a single line display and a ZIF socket to program the EPROM. Oh, and the program memory was limited to 1024 instructions.
    This is just to say that this CPU design isn’t as weird as it seems (even if I wouldn’t wish my worst enemy to have to use it).

  4. While it’s a cool project, I don’t like the implication on his site that he made the first strange computer architecture ever.

    Here’s a challenge: make a CPU that has as small an instruction set as possible, while still being easy to program.

  5. There’s probably uses for a 2 bit computer.

    Most CD players use single 1 bit DACs clocked really fast with buffer RAM to synch the channels instead of dual 16 bit DACs running at the exact speed required to decode the audio in real time.

    1. And most other CD players don’t operate at exactly 44.1KHz, either: Oversampling is more the rule than the exception (4x or 8x being common multipliers). Doing so keeps the antialiasing filter simple, cheap, and inaudible.

    1. Lessee.. 10THz is 10^13 cycles per second, and the speed of light is 3*10^8 meters per second. So at the speed of light, the physical distance between signal wavefronts will be about 30 microns. Actual signal propagation rates are about half that, so let’s call it 15 microns.

      15 microns is 15,000 nanometers, and we have 32 nanometer processes these days. That means you could fit about 450 minimum-aspect features between wavefronts. A minimum-aspect transistor is about 9×16 minimum feature squares, so let’s be generous and say you could fit a 30×40 array of transistors into the space between wavefronts.

      Of course, that assumes that the transistor at the left side will be a full clock cycle ahead of the one at the right side (or vice versa). They’ll be 180 degrees out of phase within 15-20, and 90 degrees out of phase within 8-10.

      Conclusion #1: clock synchronization is gonna be a bitch.

      Each transistor has the usual junction, gate, and substrate parasitic capacitances, which are generally measured in the femtofarad/picofarad range. Assuming you have to shove 100 femtofarads of capacitance around to turn the transistor on or off, then multiplying by frequency, you’ll need an amp of current per transistor to make the thing go. Let’s be generous and assume that the logic levels are 100mV apart, so we only get 100mW per transistor. That translates to something like 12W of power consumption out of our 30×40 array of impossible-to-sync transistors.

      Conclusion 2: the energy density of the chip will be well beyond that of the average nuclear reactor core.

      Copper interconnects suffer 99% electromigration failure within an hour at current densities of 15 million amps per square centimeter of cross section. There are 10^14 square nanometers per square centimeter, but a feature is 32 nanometers wide and the average interconnect is about a micron thick.. 32,000 square nanometers.

      Plugging and chugging, that gives us a MTBF well under an hour for a minimum-feature interconnect carrying more than about 500 microamps.

      We need an amp to drive the transistor, and let’s be generous and assume the MTBF scales linearly with current. That means the average trace has an MTBF of about 500 microseconds.

      Conclusion 3; I don’t know whether to nominate this for a Magic Smoke award or a Darwin award.

  6. I do not think this can be considered as a CPU, the whole thing seems to be just a control unit, part of a CPU. As someone has pointed out it is a kind of lazy instruction decoder.

    1. That’s not really true, I don’t know why the article only talks about the instruction decoder, because that’s just one (and relatively small) part of this architecture.

      It’s a fully featured and higly customizable/modular architecture, I even implemented a playable Pong clone on this architecture as well as many other examples, but the article doesn’t mention any of this for some reason.

      1. Yes, I agree, but what the article is referring to is solely can be considered as a control unit.

        I have briefly read the documentation and agree that this is a modular system with up to 256 units. But how the units are actually implemented is not discussed, they are assumed to be “somehow” available. Only addressing and controling those units are described as far as I can see.
        So in my view the whole innovation is on how these units can be connected and controlled by these control signals generated by what is referred here as the “WPU” .

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.