Although generative language models have found little widespread, profitable adoption outside of putting artists out of work and giving tech companies an easy scapegoat for cutting staff, their their underlying technology remains a fascinating area of study. Stepping back to the more innocent time of the late 2010s, before the cultural backlash, we could examine these models in their early stages. Or, we could see how even older technology processes these types of machine learning algorithms in order to understand more about their fundamentals. [Damien] has put a 60s-era IBM as well as a PDP-11 to work training a transformer algorithm in order to take a closer look at it.
For such old hardware, the task [Damien] is training his transformer to do is to reverse a list of digits. This is a trivial problem for something like a Python program but much more difficult for a transformer. The model relies solely on self-attention and a residual connection. To fit within the 32KB memory limit of the PDP-11, it employs fixed-point arithmetic and lookup tables to replace computationally expensive functions. Training is optimized with hand-tuned learning rates and stochastic gradient descent, achieving 100% accuracy in 350 steps. In the real world, this means that he was able to get the training time down from hours or days to around five minutes.
Not only does a project like this help understand these tools, but it also goes a long way towards demonstrating that not every task needs a gigawatt datacenter to be useful. In fact, we’ve seen plenty of large language models and other generative AI running on computers no more powerful than an ESP32 or, if you need slightly more computing power, on consumer-grade PCs with or without GPUs.

Please be kind and respectful to help make the comments section excellent. (Comment Policy)