How AI Large Language Models Work, Explained Without Math

Large Language Models (LLMs ) are everywhere, but how exactly do they work under the hood? [Miguel Grinberg] provides a great explanation of the inner workings of LLMs in simple (but not simplistic) terms that eschews the low-level mathematics of how they work in favor of laying bare what it is they do.

At their heart, LLMs are prediction machines that work on tokens (small groups of letters and punctuation) and are as a result capable of great feats of human-seeming communication. Most technical-minded people understand that LLMs have no idea what they are saying, and this peek at their inner workings will make that abundantly clear.

Be sure to also review an illustrated guide to how image-generating AIs work. And if a peek under the hood of LLMs left you hungry for more low-level details, check out our coverage of training a GPT-2 LLM using pure C code.

24 thoughts on “How AI Large Language Models Work, Explained Without Math

    1. I’d rather have all maths explained and demonstrated in python and/or matlab. Mathemagicians seem to invent their own symbols and casually interchange them as they go along, in my experience. Very confusing when trying to learn something new.

      1. Yes x 9000 !
        I got frustrated by this! I was comparing math ai back-propagation papers but each had subscript and superscripts with various interpretations. It was so frustrating that i made a video:

        No 2d subscripts/superscripts. Every term has a unique id so there is NO misinterpretation!

        https://youtu.be/DTRNOJBIDMY?si=UuI5kNyYhsyvk1vM

        Honestly i think the math people WANTED their papers to be confusing !

  1. HackaDay should take a poll.

    How many readers took calculus, how many took calculus for business majors (arm wave calc without math), how many are just really bad at math (no calc at all) and how many are kids still working math class?

    I don’t think this is on point for hackaday readers. But accept I could be wrong.
    There are some clearly innumerate posters here.

    1. Calc 1, Calc 2, Multivariable Calc, Differential Equations, Strength Of Materials and Fluid Dynamics. I got all the math….but then I went into Finance :P and I couldn’t do those problems without significant brushing up and maybe not even then (this was near 20 years ago)

      1. I didn’t learn how to properly study / memorize in a way that would help me remember until after college.

        The only way I have found to work well is to write down what I want to recall.

        Reciting aloud what I want to remember works less well.

        Any thought exercise involving visualization is utterly useless for me.

        Having a deficient episodic/autobiographical memory means I cannot mentally retake classes.

        1. >a deficient episodic/autobiographical memory

          Most people can’t remember what they did last week, let alone ten years ago. Forgetting is a part of memory processing, and those people who can remember “everything” are typically deficient in much worse ways.

          If you really want to memorize something, the technique is called spaced learning. Each repetition retains the memory longer until eventually it “sticks”. The training episodes are spaced further and further apart to the moment you’re just about to forget it, and the brain learns to keep the information. This is known as “overlearning”.

          This is also why certain skills like knowing which command line operations to perform to configure a computer program are difficult if not impossible to retain. Since the operation is typically only performed once in a blue moon, you’re starting from scratch every time and you forget it by the time you need to do it again. People who have over-learned the operation because they have performed it repeatedly find it easy and efficient, and cannot fathom why others find it obtuse.

    2. The vast majority who are in these comments took calculus, and is considered “better than average” at math.

      But I agree that despite my math minor, I prefer the math without the symbolic notation for the quickest comprehension of this.

    1. There’s a tiny bit of calculus in the backpropagation of errors.

      Effectively, during training you grab the node in error, calculate the derivative of the prior nodes’ effects on the error node, then multiply that derivative (ie – slope of the amount of effect) by the learning rate, and use that to adjust the weights on those previous nodes. Then you take those previous nodes and repeat the operation to backpropagate the error adjustments further back, until you eventually reach the input nodes and the process stops. (The input nodes have no prior nodes.)

      What was interesting to me is that this backprop derivative can be generated automatically by the compiler. Essentially, the programmer codes the forward characteristics of the nodes and the compiler can automatically calculate and generate code for the backprop derivative of those nodes. Then a simple library interface can apply the backprop learning on all nodes by calling the backprop function appropriate to each node.

      Put whatever multiply/divide/sum into a node, and the compiler will generate code for the forward propagation, and code for the derivative/reverse propagation as well.

    2. Yes, as PWalsh mentions it is important to know that all the major platforms (TensorFlow, PyTorch, etc) upon model compile will perform differentiation for you of your loss function on back-prop (Gradient Tape in the TF case, Autograd in Pytorch).

      Karpathy even has an open source autograd type engine you can take a look at if you want to: https://github.com/karpathy/micrograd

      However, this is not just ‘laziness’ or even ‘convenience’– I mean if your model has 3 layers, year you can work the derivatives out by hand– but if you are working with, say *100* hidden layers and constantly adjusting/modifying/tweaking your model structure, doing out all the gradients by hand suddenly becomes a ‘non-trivial’ problem.

      But @carpet, you are correct– Model run train (at least in most models) it is all LA, but model construction it is Calc/LA.

    1. @a_do_z Some would say *everything* is all nothing but math :D —

      But, jokes asides, yes– There is no ‘Genie in the bottle’. It is a matter of, via neural networks, a matter of modeling the (massively multi-dimensional) function of the problem at hand.

  2. With AI, there’s a certain style of people who fall victim to, let’s say “fuzzy thinking”, when it comes to identifying causes and effects. It might also be called “magical thinking”.

    In general, it’s the confusion of an analog of something with the thing itself, such as in saying, “if it walks like a duck and talks like a duck, it’s a duck”. That approximation is reliable only when one can determine exact likeness, not just approximate likeness. After all, it could be a decoy duck and a hunter in the bushes blowing on a duck horn. Otherwise it becomes a fallacy of categorical fuzzyness, false generalization, jumping to conclusions etc.

    Where it becomes magical thinking is with the confusion of a model with the real thing. You can build an electronic circuit which, on an oscilloscope screen, behaves like a bouncing ball – so you get the game “pong”. Nobody can argue though that there is actually any balls or rackets involved. It just looks like a game of tennis, or whatever else you interpret it as. When the model becomes complex enough, the ability or willingness of people to discriminate between it and the real thing diminishes to the point that a lot of people just start calling it the real thing – sort of like when people begin to feel that a magic trick is real magic because they don’t know what “magic” actually is and they can’t see the magicians hands doing the trick. Something mysterious is happening, so instead of looking deeper into the mystery, they declare special powers to be the cause of the effect. Mere coincidences start to look like miracles and wrong models that return correct answers by dumb luck or special fitting of the model are deemed correct models. I prayed for rain yesterday and today it rains, therefore prayer works and my mental model of the world that includes deities and spirits is correct. My chat program returns intelligible answers, therefore it is intelligent.

    The advanced school of thought around the subject is equally lazy. Those who agree that there is no such thing as “magic” reject correct explanations of the phenomenon in favor of the analog models of it, treating the model as the thing itself. In effect, they argue that since there is no definition to a thing, then whatever appears to be it, is it. All magic is just tricks, all phenomena can be explained by your present model of reality and nothing outside of it. It’s a form of an appeal to ignorance: a thing like “intelligence” is nothing but a computer program, because that’s the only working model we have to deal with the concept.

    It’s the 19th century arrogance of saying “All physics is solved, we just need to work out the details.”. Then they invented quantum mechanics, which was discovered because the Newtonian models had gaping holes that couldn’t explain some things. The second school of though here turns a blind eye to the “holes” in the model, ignoring its failures, instead arguing that these phenomena we’re seeing are mere illusions and artifacts, not real, so they don’t need further examination. It leads to the conclusion that anything beyond a computer program as a definition for intelligence must be an apparition, confusion, “magic”, and therefore not real – because you’ve already decided that you know all the relevant theory about the topic. It’s just a matter of working out the details and figuring out which computer programs are “intelligent”. Then it reduces back to the first kind of judgement where getting it “close enough” is deemed a success.

    1. I’m not going to say that the modern “Eliza on a Cray Supercomputer” LLMs are sentient. By 1980s terms a modern computer is a Cray supercomputer, FYI and then some.. Just modern tech applied to chatbots for now.

      However does one of our Neruons have any idea what it is doing?
      If there’s a functional intelligence does it matter if it’s electrochemical communication between cells of a colony organism, independent creatures that act together, gears and cogwheels or electrons or virtual “neurons” within the latter? Oh, yes, we are approaching the Atomic Spin computer era (Quantum sounds better) and that’s where there’s junk science speculation our conciousness is in that our nerve cells might be able to “QUantum Teleport” information and that might make things like Telepathy legit again…

      Myself I’m wild for the modern AI tech and we should support it. The “Politics” needs to be for UBI – Universal Basic Income. Pay for it by cutting all the criminal tax breaks and deals the elites have. The empty malls are a product of unrented space being written off of taxes and combined with fake businesses and associations with other businesses huge money made by negative taxes. That’s why property holding companies that pull in malls usually have thousands of properties. Too bad we don’t have a real media anymore, just entertainment of different flavors getting us arguing against each other over dumb issues.

  3. ” all [LLMs] can do is take some text you provide as input and guess what the next word (or more accurately, the next token) is going to be”

    To me this summarizes the article best

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.