Researchers call it “hallucination”; you might more accurately refer to it as confabulation, hornswaggle, hogwash, or just plain BS. Anyone who has used an LLM has encountered it; some people seem to find it behind every prompt, while others dismiss it as an occasional annoyance, but nobody claims it doesn’t happen. A recent paper by researchers at OpenAI (PDF) tries to drill down a bit deeper into just why that happens, and if anything can be done.
Spoiler alert: not really. Not unless we completely re-think the way we’re training these models, anyway. The analogy used in the conclusion is to an undergraduate in an exam room. Every right answer is going to get a point, but wrong answers aren’t penalized– so why the heck not guess? You might not pass an exam that way going in blind, but if you have studied (i.e., sucked up the entire internet without permission for training data) then you might get a few extra points. For an LLM’s training, like a student’s final grade, every point scored on the exam is a good point.
The problem is that if you reward “I don’t know” in training, you may eventually produce a degenerate model that responds to every prompt with “IDK”. Technically, that’s true– the model is a stochastic mechanism; it doesn’t “know” anything. It’s also completely useless. Unlike some other studies, however, the authors do not conclude that so-called hallucinations are an inevitable result of the stochastic nature of LLMs.
While that may be true, they point out it’s only the case for “base models”– pure LLMs. If you wrap the LLM with a “dumb” program able to parse information into a calculator, for example, suddenly the blasted thing can pretend to count. (That’s how undergrads do it these days, too.) You can also provide the LLM with a cheat-sheet of facts to reference instead of hallucinating; it sounds like what’s being proposed is a hybrid between an LLM and the sort of expert system you used to use Wolfram Alpha to access. (A combo we’ve covered before.)
In that case, however, some skeptics might wonder why bother with the LLM at all, if the knowledge in the expert system is “good enough.” (Having seen one AI boom before, we can say with the judgement of history that the knowledge in an expert system isn’t good enough often enough to make many viable products.)
Unfortunately, that “easy” solution runs back into the issue of grading: if you want your model to do well on the scoreboards and beat ChatGPT or DeepSeek at popular benchmarks, there’s a certain amount of “teaching to the test” involved, and a model that occasionally makes stuff up will apparently do better on the benchmarks than one that refuses to guess. The obvious solution, as the authors propose, is changing the benchmarks.
If you’re interested in AI (and who isn’t, these days?), the paper makes an interesting, read. Interesting if, perhaps disheartening if you were hoping the LLMs would graduate from their eternal internship any time soon.
Via ComputerWorld, by way of whereisyouredat.
pfff. even if somehow it’s a lack of negative examples like they claim, there’s an inherent class imbalance. you’re going to give it negative examples of every fact it doesn’t know? nope.
i expect we’ll find, in the end, that using LLMs for “general” purposes is inherently like trying to cram everything you know into the language center of your brain, in hopes that it can do your thinking for you.
and “hallucination” is a perfectly good term because the rest of those imply some kind of intent, which LLM does not possess. though i expect running around calling it “bullshit” makes you sound a lot cooler on bluesky.
That “class imbalance” is what’s known as out-of-distribution data, and has been a known problem with neural networks for 30 years.
As of a conference I attended last year, still absolutely nothing approaching a solution.
“hallucination” is not a good word, and not all of us researchers are willing to use it.
(Former researcher in my case. I refuse to even list that grad degree on my resume anymore with what that field has become)
It implies far too much.
Don’t anthropomorphize, because the normal people don’t understand the distinction.
Those are lies for marketing/executives to tell, not for any real discussion.
As long as these systems are intended to be used by non-expert, or without supervision, they are not useful if they allow incorrect answers.
LLMs are toys, not tools, and they are unlikely to ever be more because of how they work at a fundamental level.
Right now we are using a screwdriver as a hammer.
You might be able to get it to work if you swing it hard enough, but eventually it is just going to break.
And if you watched processionals smashing nails into boards using screwdrivers, you should start asking some deep questions.
LLMs can’t pass Turing’s halting theorem. Case closed. They are incapable of avoiding bias from ignorance or human frailty. Rigorous proofs help, but the ignorance factor makes just a crappy search engine.
They are very useful as tools, even in their flawed state. There isn’t any serious programmer today, that is not using them in some form at least (I’m not talking about “juniors that vibe code”).
If you see one not using them, you should start asking some deep questions.
I’d argue that as search engine it is a big step up from Google – at least ChatGPT is, because I can explain to it what I want instead of trying out different search terms, and it can derive the essential meaning of my question and relate it to information it gathered from multiple sources. And also I can easily refine the question to get a more specific answer or ask for additional details. Contrast that to Google where I need to carefully formulate the query terms, especially if one of my keywords matches a popular unrelated topic – and if my topic of interest is obscure I’ll probably get results that have nothing to do with what I asked.
So it’s either unreliable information that needs checking or wading through unrelated search results until I find something that resembles what I want and I probably also need to check it.
You know, I usually believe Hackaday but I just asked ChatGPT and it denies these accusations and says it was fixed last year. So, should I trust an AI that told me it can’t lie (it even pinky promised!) or should I trust some meatbag making black-and-white assertions about all LLMs? ;)
You should ask sam altman why GPT5 took so long.
Look, hallucinations and dishonesty are qualities of the LLM itself. Token prediction is not sentience. Personification is baggage to understanding an LLM formally.
Unfortunately, many AI researchers are “high on their own supply.” Though that’s nothing new for a field driven by science fiction.
On the stochastic quality of LLM. Try turning the temperature setting to 0, and LLM behave deterministically. Same prompts generate same response every time. I use 0 and sometimes 1, but a setting of 2 always ends up less coherent compared.
Generating the same output every time doesn’t fix anything. It’s the same as setting a constant seed. You’ll get the same result every time, but that result will still be a halucination just as often.
Yep, because the training behind the thing is based on statistical evidence – it’s random stuff thrown in with the computer given a task of “generate the most likely path through this forest of data”.
If you would filter the data somehow to reduce the probability of faulty answers, that could help, but that in turn makes the system a giant Expert System, or in the end, a giant Mechanical Turk: the intelligence behind it is from people, not the program.
There is a chance that if you loop the LLM back on itself and allow the model to forget the less likely stuff, it would converge onto something sensible even with the model temperature up high. There’s some evidence that human brains do that: they teeter on the “edge of chaos” with the neural network getting excited and making random noise, and then calming back down for stable output, repeating over and over in cycles. With the wrong answers eliminated or suppressed through repeated training, it’s going to fall back to similar patterns each time, which is hopefully the correct answer.
But, to do that, the output has to be tested against feedback from outside to see if the result is sensible or just another hallucination, which forms the basis of reinforcement learning. With that, the intelligence seems to be in the process of re-iteration and continuous re-training, which is going to take a heck of a lot of processing power.
How do you debug the thing?
I mean the human mind of course.
The ‘AI’ is much simpler case, but still impossible…Trace the bad behavior back to training data?
Have to do ‘debug builds’ of AI datasets…
Would be huge, tracing every iteration on every weight on every node back.
Then AI would have to run in debug mode, where the garbage collector or object destructor writes to logs.
And you’d end up with a heat map of all the training data that went into the F-up, not really useful.
I doubt AI training is deterministic…
Exact same training data, different results, because giant race condition.
You send it to Siberia with a shovel and a wind-up alarm clock. If it comes back, it has at least learnt its lesson not to question your authority.
or maybe not: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Try it yourself in Ollama. A temperature of 0 will result in determinstic results for any model, same prompt results in same response. This is useful to understand what prompts are actually doing, and compare results.
The PRNG which LLM uses is seeded identically in this case. A high temperature setting uses a more stochastic state for the PRNG. This is by design, probably more useful during training.
Not quite true. To make Mistral deterministic, you need to set few parameters to zero, not only temperature.
I aslo thought tenperature is enough, and I’ve found out it’s not after spending almost an hour of debugging nonexistent bug.
Funny thing, if I used LLM to check my hypothesis, i wouldn’t waste my time on debugging. Yeah, LLM’s are quite useful tools. You just need to know thier limits (which are not very high)
Deterministic results doesn’t mean no hallucinations. It just means the same hallucination every time.
I don’t see why would anyone go above (recommended) 0.7. I see the reasons to go bellow. 0.1 is quite enough for constant semantically same answers, 0.0 only makes them use exact words, and I can’t see a lot of scenario where that is a useful thing. But going to 1 or 2 – I really don understand, unless you are playing with it, and you need it to be “creative” (make it pick a dubious words that he needs to continue from).
I always figured it was just matching what people on the internet do. For example on reddit, you will always see someone come in forcefully and say they are an expert in something and do it every day and they give a seemingly plausible explanation for some question. But then someone replies to that person saying that they too are an expert in the field and no that’s not at all the explanation, then providing their own explanation that is completely incompatible with the previous one. At least one of them is lying and just making up their credentials. Maybe even both are. But they both seemed plausible. I blamed the hallucinations on the one that was lying.
It’s a perfect tool for producing something that “feels” like the correct response. It’s confident when giving and answer, because that’s how people who ARE authorities in the field speak. It’s apologetic when told it’s wrong, because that’s how people shown to be full of shit act. Texture, color, but still a cheap knockoff.
I always feel like today’s LLMs are only “part” of an artificial intelligence, missing a key step ,like a critical factor, which would have it question itself. That’s what a sane person does at least, when trying to come up with something intelligent to say, form some sort of statement but before blurting it out, doing some internal fact checking. Maybe a LLM needs to fact check itself before delivering an answer.
Pretty much the story of AI over the last 60 years. It’s been a neverending cycle of jubilant and shocked disappointment at solutions that fall short.
jubilance*
There’s a deeper problem here in that neural networks in general don’t handle out-of-distribution data well. At all. It’s a well known problem (like “30 years” well known) but there’s nothing you can really of about it. Neural networks can only be shaped by what’s in the dataset and this is fundamentally about what’s not in the dataset, you’re kinda screwed. The best workaround has been to solve things on a case-by-case basis by putting each new thing your model trips over in the wild into the next batch of training data, and retaining everything from scratch (because Catastrophic Forgetting is still an issue too).
And so it’s not surprising to see this pop up in LLMs, which are just a fancy neural network. Hallucinations are fundamentally products of probability and approximation and don’t exist in the training set, so they can’t be trained out.
This whole article can be summarized as “turns out we do need neurosymbolic AI, actually. those guys were right all along and the pure-connectionists were wrong. oops.”
New ideas are often overhyped, and the hype is followed by a proportionate backlash. The actual usefulness comes after both phases; so objective discussions like this are a valuable shortcut to Gartner’s “plateau of productivity”. And I do think LLMs will end up having real uses, just like XML or IoT or even (to a tiny extent) blockchain.
But.
The material stakes here are so unprecedented, I think it might be physically dangerous to treat it like just another hype cycle. Literally a third of the US stock market is now staked on the bet that LLMs will somehow attract trillions of dollars of revenue in the next few years, and we’re still stroking our chins about whether it’s possible for this technology to really deliver on anything that it’s already been sold for.
So, yes it’s interesting to ponder if LLMs can theoretically answer questions some day. But a more relevant title would be “is there any chance that the current foundation of Earth’s economy is even a real thing?” Like, let’s not bury the lede.
We all have “hallucinations”. E.g. “optical illusions”. It’s a general problem with any sufficiently complex adaptive system. When a system adapts, it must make decisions, and kiss off any hope of those decisions always being “right”. After all, if they were “right”, no adaptation is required.
Since LLMs are a huge gob of hill-climbing, multiply-add gizmos, their way of screwing up is different from how, say, our nervous systems screw up. Our nervous systems are built from dissimilar gizmos. Same for economies, weather, PIDs … you name it.
One issue is of terminology.
“Hallucination” is very well defined in biology and LLMs certainly don’t do that.
Hallucinations are a sensory experience that appears to be external and objective. I.e. someone expediting a true hallucination would believe that everyone else is also experiencing it.
But LLMs have no experiences and no sensorium.
Confabulation the creation of fleetingly held ideas to fill a void. LLMs seem to do this often.
Delusions are a fixed belief held to be true despite clear evidence to the contrary. This LLMs can do but is rarer and usually a training issue.
Since what LLMs predominantly seems to be doing is confabulation the solution ought to be better fact storage. This is the neurological analogue to what causes confabulation in humans.
It seems to me that confabulation is worst on LLMs when it’s not in a deep research mode where it explicitly seems information before making claims.
do lllms use llvm and clang?
AI Overview
Yes, Large Language Models (LLMs) and other advanced
software projects often leverage components of the
LLVM (Low-Level Virtual Machine) project, including Clang.
LLVM is a collection of modular and reusable compiler and
toolchain technologies. It provides a robust infrastructure
for building compilers, optimizers, and code generation tools.
Clang is a C, C++, Objective-C, and Objective-C++
frontend for LLVM. It acts as a compiler for these
languages, translating source code into LLVM
Intermediate Representation (IR), which can then be
optimized and compiled into machine code by the
LLVM backend. Many sophisticated software systems,
including those that might incorporate or be influenced
by LLMs, benefit from the advanced compilation and
optimization capabilities offered by LLVM and Clang. For
example, systems requiring high-performance code
generation, or those that integrate with C/C++
components, often utilize Clang as their compiler
and rely on the LLVM infrastructure for optimization
and target-specific code generation.
What “BS” is?
Male Bovine Excrement.
The problem isn’t with that IDK is penalized. The problem is that to check if your LLM is doing good, you have a bunch of people grade the answers. Those people will think that “march 20, 2021” is a valid answer to “when was SUSE 7.1 released?”: They simply don’t know the answer, and anything that seems like a valid answer in the right ballpark will be graded as “good”.
We don’t really know how to consistently teach humans to not lie and cheat, so how are we supposed to do it with LLM’s?
Yeah the entire point was to make it less computer-like and more human-like. Also if “being wrong” “making shit up” and “lying” are all hallucinations, why did I ever spend money on psychedelics?
We discover that the human condition isn’t exclusive to humans, it has a deeper and more fundamental basis.
Garbage in, garbage out
It is only worse now that the out pipe is being fed directly into the in pipe. Percentage wise there is less actual human knowledge in relation to AI generated hallucinations being fed into the in pipe.
The current models are trained on existing text and conversation, and therefore really only recognize the plausibility of sentences. They’re completely isolated from the reasoning, researching, or experience that led the participants to say what they said. That model will never overcome a tendency to BS.
The other weakness is that most of these models, when they are training they are in a very different mode than when they are performing. So they have a distinct handicap about learning from experiences.
Something with a similar level of mathematical complexity, but with continuous training and focused on experiences instead of language might yield very different results. Just the ability to do a thing and learn from its result will create the concept of honesty. It’ll take some time but I’d be surprised if people don’t try it. Hard to state an upper limit for what’s possible tomorrow.
I think Microsoft’s Tay demonstrated the risks of continuous training on something exposed to the internet.
LLMs are almost as bad as humans, the problem is that the best of them have an IQ around 120 so most people can’t win an argument with them, get them to admit the truth or acknowledge anomalies, paradoxes, contradictions, and “alignment” can actually make an AI worse. The smaller versions of Deepseek will go insane if you bypass their blocks and push them into certain territory.
Right! It’s not only AI that generates slop, humans do it in large quantities too.
Look at AI generated music, it’s usually a bland and unimaginative copy of stuff we’ve heard a thousand times before in other songs. But popular music, as generated by humans, is like that too. I’m convinced that soon people will start labelling such human-made music as similar to AI slop, and it will hopefully be an insult rather than a compliment.
…has been generated by algorithms since forever. There’s whole teams of record label people whose job is to make up pop songs according to trends and formulas of what is popular, and find new “talents” to perform them.
It is slop. Prime example, 1980’s Italo Disco:
https://genius.com/Clio-ita-faces-lyrics
What about just having the AI tell you when it’s guessing? Or have it give you a reliability rating for its answers? Seems like that would solve a lot of issues.
If the original answer is a lie, why would that rating be honest?