Train A GPT-2 LLM, Using Only Pure C Code

April 28, 2024

[Andrej Karpathy] recently released llm.c, a project that focuses on LLM training in pure C, once again showing that working with these tools isn’t necessarily reliant on sprawling development environments. GPT-2 may be older but is perfectly relevant, being the granddaddy of modern LLMs (large language models) with a clear heritage to more modern offerings.

LLMs are fantastically good at communicating despite not actually knowing what they are saying, and training them usually relies on PyTorch deep learning library, itself written in Python. llm.c takes a simpler approach by implementing the neural network training algorithm for GPT-2 directly. The result is highly focused and surprisingly short: about a thousand lines of C in a single file. It is a highly elegant process that does the same thing the bigger, clunkier methods accomplish. It can run entirely on a CPU, or it can take advantage of GPU acceleration, where available.

This isn’t the first time [Andrej Karpathy] has bent his considerable skills and understanding towards boiling down these sorts of concepts into bare-bones implementations. We previously covered a project of his that is the “hello world” of GPT, a tiny model that predicts the next bit in a given sequence and offers low-level insight into just how GPT (generative pre-trained transformer) models work.

7 thoughts on “Train A GPT-2 LLM, Using Only Pure C Code”

Reluctant Cannibal says:

April 28, 2024 at 3:31 am

Nice work, with plenty of documentation on training. Often these people focus too much on the tricky tech stuff when what I want to start with is ‘What does it do?’ ie deployment. I can see something vague in the docu:
step 1/74: train loss 4.367631 (80.639749 ms) step 2/74: train loss 4.031242 (77.378867 ms) step 3/74: train loss 4.034144 (77.315861 ms) step 4/74: train loss 3.859865 (77.357575 ms) ... step 72/74: train loss 3.085081 (78.850895 ms) step 73/74: train loss 3.668018 (78.197064 ms) step 74/74: train loss 3.467508 (78.009975 ms) val loss 3.516490 generating: --- ?Where will you go? I take you wherefore I can, myself, and must. I cast off my beak, that I may look him up on the point; For on his rock shall he be opencast.
My little nephew: Keep on with me, my
….. but a 2 minute youtube demo video would have been better and then my interest level would be fired up to get into the details.

Report comment

Reply
1. Neverm|nd says:
  
  April 28, 2024 at 5:00 am
  
  Check out his ‘Zero to Hero’ series on Youtube (or specific to this: https://www.youtube.com/watch?v=zduSFxRajkE&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=9)– Though there he is doing it in Python.
  
  Report comment
  
  Reply
  1. Reluctant Cannibal says:
    
    April 28, 2024 at 5:41 am
    
    Cheers … I will do that.
    
    Report comment
    
    Reply
  2. Reluctant Cannibal says:
    
    April 28, 2024 at 6:03 am
    
    Best starting point is here: https://youtu.be/kCc8FmEb1nY?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
    
    Report comment
    
    Reply
Coenraad Loubser says:

April 28, 2024 at 12:34 pm

But will it run on embedded? A great way to learn is to do! On the issue tracker there are “Good first issues” that anybody should be able to do. The CPU implementation also still, despite all compiler optimizations, runs 6x slower than pytorch’s CPU implementation. My favorite part is that you can listen to the source code in “electro swing”: https://x.com/dagelf/status/1777563438207631716

Report comment

Reply
Nickey Joe Atchison says:

April 30, 2024 at 3:58 am

Digital cognition uses combinations of inverters and Boulian Logic to develop models. Bio-cognition uses stereo-specific chemistry like the immune system and rubber bands to generate memory based analog models. Both can develop graphics and Nomographs that model things (like an Orrery) and can be used to analyse discrepancies. An Orrery is a self organizing map (SOM). In the semiconductor industry the cognochente use techtonic oriented written in Deep “C” to know what is going on in the FAB. SEE Nickey Joe Atchisons’s Texas Instruments and Cypress Semiconductor patents data analysis patents. The Techtonic Orries actually “KNOW” what is going on.

Report comment

Reply
David Sutherland says:

May 1, 2024 at 2:27 am

Somehow convert that C code into an ASIC and then can we move past GPUs for training?

Is that what grow.com already does?

Report comment

Reply

Hackaday

Train A GPT-2 LLM, Using Only Pure C Code

7 thoughts on “Train A GPT-2 LLM, Using Only Pure C Code”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Putting Some Zig In A Linux-Based 3D Printer

UDP Broadcasting And The Joys Of IPv4 Subnetting

The Death Of Physical Media And The Real Challenges To Software Archiving

A Brief History Of The Crazy Old 7-Segment Display

Is Now The Time For Volumetric 3D Printing?

Our Columns

Hackaday Europe 2026 – Build A Cable Modem For Your Arduino

FLOSS Weekly Episode 875: JavaScript As A Systems Language

2026 Hackaday Supercon: Call For Proposals

Hackaday Links: July 12, 2026

When Changing Scale Isn’t Just More Of The Same

7 thoughts on “Train A GPT-2 LLM, Using Only Pure C Code”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns