Although generative language models have found little widespread, profitable adoption outside of putting artists out of work and giving tech companies an easy scapegoat for cutting staff, their their underlying technology remains a fascinating area of study. Stepping back to the more innocent time of the late 2010s, before the cultural backlash, we could examine these models in their early stages. Or, we could see how even older technology processes these types of machine learning algorithms in order to understand more about their fundamentals. [Damien Boureille] has put a 60s-era IBM as well as a PDP-11 to work training a transformer algorithm in order to take a closer look at it.
For such old hardware, the task [Damien Boureille] is training his transformer to do is to reverse a list of digits. This is a trivial problem for something like a Python program but much more difficult for a transformer. The model relies solely on self-attention and a residual connection. To fit within the 32KB memory limit of the PDP-11, it employs fixed-point arithmetic and lookup tables to replace computationally expensive functions. Training is optimized with hand-tuned learning rates and stochastic gradient descent, achieving 100% accuracy in 350 steps. In the real world, this means that he was able to get the training time down from hours or days to around five minutes.
Not only does a project like this help understand these tools, but it also goes a long way towards demonstrating that not every task needs a gigawatt datacenter to be useful. In fact, we’ve seen plenty of large language models and other generative AI running on computers no more powerful than an ESP32 or, if you need slightly more computing power, on consumer-grade PCs with or without GPUs.

I would like a much more detailed explanation.
Per local convention, that yellow text in the first paragraph is a hyperlink: Click on that to see a much more comprehensive body of text, with additional hyperlinks to further information.
I wonder if an operating system could be coded so tightly it could not be hacked—not assembler—a 0 and a 1 at a time.
A savant, an Andrew Wiles, or quantum AI might…I hope.
Then that could be hard wired into infrastructure, with AGI like Colossus unable to break into it later on, no matter how sophisticated.
O/T rant over
Long, long time ago, in the galaxy right here we used to assemble programs by copying and pasting snippets of punch tape from a library of functions with some code in between written in opcodes by manual punch machine. Then a master punch tape was made, verified using a comparator machine and handed to sysop. The tape was ran on the computer and it either worked or not. If it worked, the results in form of printout or new punch tape were returned. No OS, no assembler, not even 0’s, only 1’s in form of holes…
As for hard-wiring, Apollo Guidance Computer software was literally that, using core rope memory. And if somehow a software bug sneaked into the rope, they had to make a new one. And not once, but multiple times as there were multiple computers that needed a copy for testing and training and redundancy…
Sure. it’s not hard to put the logic in rom (including on-mask rom). If the original code was correct without bugs, it will be without bugs until the hardware itself is destroyed. The main issue is that, for nontrivial logic, this rapidly gets expensive.
The issue is not the tight coding. It’s the non-alterability. Otherwise, no matter how well-coded it is, you can attack it simply by altering it so that it either directly does the “altered task” or so that it now has an external control vector and can be instructed to alter the task on the fly.
Read-only logic is both the solution to this problem and a technically-burdensome challenge (scalability/performance/diespace limits, and still vulnerable to physical attacks like delidding and ablating the die)
It’s not a panacea.
If the machine is designated to perform a specific job, it can be hardwired to do that job. This was done in industrial control way before computers were invented. For example automata in Disney parks used to be operated by using stacks of cams, each cam operating a single function in one of the puppets or scenery. In 1980’s and 1990’s we used lots and lots of EPROMs that were UV-wiped. Early microcontrollers also used UV-wiped EPROMs for storing code internally. People used EPROM emulators that stored code in RAM for development and debugging, then they burned the ROM and tested it in a device.
The real problem is when you want a general computer. The OS can be hardwired, but the user code can’t. You can separate user spaces from each other by hardware or software memory controller, but still user code needs to be able to do anything within its memory space. So unless there is a hardware task switching system that can’t be controlled by software, a user’s program could take over all resources and never give them up. That’s why we have administrator accounts, which creates its own set of vulnerabilities, and the more complex the system, the more holes there will be. No matter how many people audit the source code…
On related note: LLMs can’t write secure code because these models are trained on all the source code available online, and most of it is badly written. I don’t think the “quantum AI” will be much better, but it would be faster…
Assembly language is a representation of bit patterns generally being 1:1 for most architectures* so doing bit level coding would just be stupid, an assembler is the tool that convert assembly language to the binary representation by the way.
Hacking takes advantage of 3 general things: software bugs, hardware bugs, or indirect exposure of privileged data. Your idea wouldn’t help with any of those things.
(* not true for x86 due to the way it was developed, some instructions can have several bit pattern encodings)
no because all ‘hacked’ means in this type of context breaks down to doing something the originator didnt intent. The originator can’t know all the possible intents of anything even if its just flipping the on switch.
“Not only does a project like this help understand these tools, but it also goes a long way towards demonstrating that not every task needs a gigawatt datacenter to be useful.”
DOES it, though? Because, while I agree the project is interesting, and has great educational value, I would argue that it’s very far from useful. If anything, I’d say that it helps make a case FOR those gigawatt data centers. If it takes dedicating an entire PDP-11 just to accomplish a task as trivial as reversing a list of numbers, you’re going to need an awful lot of them before you get to anything approaching real-world “useful”.
Again, I’m not criticizing the project AT ALL. I feel it’s immensely clever, and agree there’s a huge amount to be learned from and through this sort of low-fi approach. But we shouldn’t delude ourselves into thinking it’s anything more than that, and certainly not something that would be described as useful.
Reversing a sequence isn’t trivial for a neural network, it wasn’t achieved until the late 2010s.
Take a look at what the README says:
“Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google’s reference implementation of the original transformer in 2017.”
I’d say the merit of this work isn’t even the code, but the cleverness of the design with the prototypes, the hand-tuned fixed point arithmetic, then the performance tuning on the real hardware. This achieves in 6KB of assembly, 19 KB of memory and 5.5 minutes of training what required a multi-gigabyte framework stack in 2017, and orders of magnitude more resources.
Reversing a sequence isn’t trivial for a neural network, it wasn’t achieved until the late 2010s
Take a look at what the README says:
“Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google’s reference implementation of the original transformer in 2017.”
I’d say the merit of this work isn’t even the code, but the cleverness of the design with the prototypes, the hand-tuned fixed point arithmetic, then the performance tuning on the real hardware. This achieves in 6KB of assembly, 19 KB of memory and 5.5 minutes of training what required a multi-gigabyte framework stack and orders of magnitude more resources when released in 2017.
Reversing a sequence isn’t trivial for a neural network as the README explains, it wasn’t reliably achieved until the late 2010s.
I’d argue that the real merit of this work isn’t just the code, it’s the clever design: the use of prototypes, the hand-tuned fixed-point arithmetic, and the performance optimization on real hardware.
It manages to achieve in just 6 KB of assembly, 19 KB of memory, and 5.5 minutes of training, what originally required a multi-gigabyte framework and orders of magnitude more resources for the original 2017 Transformer.
Transformer as inductive voltage changer, transformer as car mutant (here the training is reasonable for me) or what for transformer?
Bah, yes, the original article is explaining whole thing, but even HAD tag is pointing to induction voltage changer articles.
It’s a transformer like the T in ChatGPT’s “Generative Pre-trained Transformer”