One of the most hilarious things you can do with an LLM-based chatbot is to ask it to do calculations. If it’s a well-written chatbot frontend, it can detect requests for arithmetic – like summing 1 and 1 – and pass it on to a dedicated calculator application, even if still cannot correctly count the ‘r’s in ‘strawberry’. This is where [Alvaro Videla] asks the question whether it is at all possible to perform arithmetic with a language model.
Since an LLM at its core is nothing but a vector space of probabilities that a matrix-based inference process uses to create a probabilistic output of tokens you’d not expect a lot of deterministic behavior. How can you do arithmetic without grounding it in some kind of deterministic process?
This is where [Alvaro]’s Rune project comes into play, which is ‘a mechanism-aware JIT compilation project for language-model arithmetic’. Although it is statistically impossible for an LLM to ever correctly perform any random series of arithmetic calculations, you can monitor the internal state of the model and interfere once the parameters of an arithmetic calculation have been identified. By putting the correct result back into the inference process and letting it continue you did not need to rely on external tools.
Ultimately this attempt sort-of worked, but was deemed a failure. It would seem that a language model is the wrong tool after all for replacing the humble calculator.

Claude has a connector to Wolfram. Let each half play to their strengths.
Fascinating, but I would put this in the category of RAG and similar efforts, burning extra gpu cycles means wr are brute forceing solutions to get LLM to behave in specific manners. The cost of electricity if this became a standard method of arithmetic computation is unfathomable and proposterous.
The answer is, as always, not to use an LLM. It’s what Gary Marcus has been arguing for quite a while now, about the need for “neurosymbolic AI” which sadly has seen next-to-zero investment during the era of “just throw more neurons at it and scale up” ie. the connectivist argument that neural networks can conquer any problem if you just have enough of them.
I had half expected to encounter a neurosymbolic system partway through reading this, where an LLM is used as a natural language frontend for a theorem solver, but even weirder this turned out to instead be an odd “hand of god” rootkit.
There’s the argument that “well physical computers are probabilistic too, they’re just very unlikely to be wrong” but the problem with that line of reasoning is that the only way you’re going to ever lower the probability of wrong answers in an LLM (or any neural network, or any statistical machine learning model, the same math applies either way) is that you’re going to have to do a HELL of a lot of training, still may not have the expressiveness in the model requires to encode the correct answers you want, may still get caught in a training plateau even if you do, and even then still can’t train in any practical sense on “all the numbers.” (even if you can generate training examples automatically.
You could cheat and hand craft a neural network which would always get it right by choosing the weights for neurons one by one, but we call that logic gate design. It also wouldn’t be very efficient, given you’re using floats to emulate booleans. Also, doing it by hand without a logic synthesis tool is kinda painful. Not impossible, but pretty painful.