People have been making mistakes — roughly the same ones — since forever, and we’ve spent about the same amount of time learning to detect and mitigate them. Artificial Intelligence (AI) systems make mistakes too, but [Bruce Schneier] and [Nathan E. Sanders] make the observation that, compared to humans, AI models make entirely different kinds of mistakes. We are perhaps less equipped to handle this unusual problem than we realize.
The basic idea is this: as humans we have tremendous experience making mistakes, and this has also given us a pretty good idea of what to expect our mistakes to look like, and how to deal with them. Humans tend to make mistakes at the edges of our knowledge, our mistakes tend to clump around the same things, we make more of them when bored or tired, and so on. We have as a result developed controls and systems of checks and balances to help reduce the frequency and limit the harm of our mistakes. But these controls don’t carry over to AI systems, because AI mistakes are pretty strange.
The mistakes of AI models (particularly Large Language Models) happen seemingly randomly and aren’t limited to particular topics or areas of knowledge. Models may unpredictably appear to lack common sense. As [Bruce] puts it, “A model might be equally likely to make a mistake on a calculus question as it is to propose that cabbages eat goats.” A slight re-wording of a question might be all it takes for a model to suddenly be confidently and utterly wrong about something it just a moment ago seemed to grasp completely. And speaking of confidence, AI mistakes aren’t accompanied by uncertainty. Of course humans are no strangers to being confidently wrong, but as a whole the sort of mistakes AI systems make aren’t the same kinds of mistakes we’re used to.
There are different ideas on how to deal with this, some of which researchers are (ahem) confidently undertaking. But for best results, we’ll need to invent new ways as well. The essay also appeared in IEEE Spectrum and isn’t terribly long, so take a few minutes to check it out and get some food for thought.
And remember, if preventing mistakes at all costs is the goal, that problem is already solved: GOODY-2 is undeniably the world’s safest AI.
In my humble experience, LLMs are great for coding as long as what you require is strictly a very standard and vanilla implementation of something.
The moment you want anything even slightly complex, or something slightly different from the “standard” thing, it will fail to give you what you want.
Also its best for getting small functions or snippets written, as opposed to large files full of code. It starts hallucinating if you do that.
so it’s a super inefficient template generator?
seriously, just have the natural language parsing tokenize it into something that can filter a list of templates and pay actual programmer to generate them. Why any company would allow the liability issues AI code generates baffles me.
In other words, they are suspiciously similar to the effect of looking up snippets of code from stack exchange or reddit and amateurishly splicing them together
You get instant feedback on how standard and vanilla things are, and you get better and better at reframing novel problems as collections of standard patterns.
Which might be bad for people doing more innovative stuff but I think it’s definitely improved my coding ability.
Which is to say “once you look under the paint layer, there is no reliable structure on which you can depend.” User beware.
I often model LLMs as containing a random number generator which occasionally causes it to output the worst possible thing at any given moment. Additionally, the worst possible thing at any moment will also be output if the hash of the input has a couple of trailing zeroes, meaning it doesn’t take very long for a malicious prompter to find a prompt injection that works.
So far nobody has found any way to meaningfully improve LLMs in a way that would allow this model to be substituted with a more preferable one. Not with larger models, no with improved training, not with feedback, not with any kind of handcrafted or hardcoded filter, outside system, or check, and not with ensembles of LLMs checking other LLMs either (see Q* aka strawberry).
Once you start modelling LLMs this way, you realize there are precious few non-adversarial (because prompt injection) and fault-tolerant (because catastrophic failure happens often and without warning) use cases left. Using LLMs as text adventure games is probably the only reliable one.
The real question is: was this story written by AI?
I have to pay good money to hallucinate should the opportunity present itself.
Maybe if people take advice from AI too often, it will be classified as a schedule I controlled substance.
I don’t, must be a skill issue
Not disputing this, but wanted to see some specific examples of AI mistakes. I figured many of them are hilarious! So I googled it… and they are!
Honestly, the kinds of mistakes that I’ve seen them make is more in-line with a rush job done by amateurs. Things that are obviously wrong, but would normally get skipped over because rush-jobs by amateurs isn’t what you might think with AI.
I’ve found that to lower the error rates it becomes important to treat the model like an amateur for whatever the task is. Being detailed with the ask, such as noting any logical loopholes in your ask or difficulties, can go a long way to lowering that error rate.
Likewise, amateurs need instructions a bit at a time. Give a new hire a 500 page manual and tell em’ to get to work and watch them fail. Give that new hire a single page with basic instructions for a given task, and your error rate drastically lowers.
This phenomenon has a name no less: “Cognitive Load”, and is usually utilized in eLearning circles to reduce cognitive load on who you’re trying to teach (or give instructions to).
These are all things I’ve tested myself, extensively, on my own local AI lab that I invested in when this hubbub all started. So by what I’m seeing there’s a pattern here.
A lot of this was also for work. We released an AI assistant, and I needed to reduce the error rate. And looking at our public opinion on the feature, looks to be working too!
Have you read “Hitchhikers Guide to the Galaxy”? Do you remember when the computer Deep Thought was asked the answer to the question of “Life, the Universe and Everything”? It took 7 million years and came back with “42”. It seems like you’re describing a similar situation, if you ask the AI an ill defined question it will come back with an stupid or useless answer, but if you ask it a simple, well explained and thought out question, you will usually get a useful response. Did I understand you right or am I making an NI mistake?
That’s not how LLMs work. You can pose the most carefully crafted, detailed question, but if an actual human hasn’t actually thought about it, nothing remotely correct will be in the training set and the output will be gibberish.
“if an actual human hasn’t actually thought about it, nothing remotely correct will be in the training set and the output will be gibberish.”
… and someone has pointed out that the good news is because these things proliferate the training set will contain more and more output from them. I was already worried about the poisoning of the global knowledge pool by fake facts, now I have given up all hope for the future of the human race.
“I was already worried about the poisoning of the global knowledge pool by fake facts”
Yeah, this was already a serious issue due to sensationalist science journalism: in upper-level physics I can’t tell you the amount of completely wrong things that are stupidly common on the Internet. This is just gonna make it a hundred times worse.
The biggest issue is that with theoretical stuff, the scientific method can’t fix it, because it’s all conceptual crap. So it doesn’t matter that it’s wrong and doesn’t make sense – because it’s all ungrounded theory no one notices.
“Give a new hire a 500 page manual and tell em’ to get to work and watch them fail.”
sounds like an effective hiring filter
Or rather a filter for seeing who needs better training to do their job, an academic can be the worst coder and a autodidact can be the best coder you have ever seen. The autodidact might just need a different approach when learning a new API or a new language.
Wow, AI really is an H1B replacement!
This is similar to my experience in the coding realm. AI is a useful tool, but you can ask it to do too much and it will valiantly and confidently give an incorrect implementation. I find it to be hugely valuable at reducing “scut” work like the autogeneration of function documents and routine accessors and the like. It saves a lot of typing, provided an actual software developer is paying attention to the output. I would not put something like Copilot in the hands of a junior developer. It takes some degree of experience to recognize and reject gibberish.
“It saves a lot of typing, provided an actual software developer is paying attention to the output.”
How is this useful, though?? Seriously, how long would it take to create a script/macro which just… does it for you perfectly without burning up half the Amazon?
We’ve been dealing with boilerplate for decades now: everyone generation so far has managed to create tools to help them that didn’t require more computing power than the actual thing being developed.
I don’t know why you would put Copilot in the hands of anyone, since you would be sending your code to Microsoft, for free.
Great post from Bruce Schneier. There are most of the time, to me, something worrying and alien in ai texts, erring differently than humans is probably part of the explanation.
All AI generated material I have read so far are too shallow and evasive. It’s like explaining a Cake by enumerating and explaining how evey single ingredient is produced, but completely missing the recipe.
In the case of many SEO recipe sites, often literally.
Yes, I agree. Humans have evolved to detect very small nuances in humans. Current AI transmit weirdness on many levels.
Maybe first and foremost, we should stop using AI for what it is not made to do. Just like you mix a cake batter with a whisk, and not a hammer, LLMs are made to deal with text, without any obligation of fully respecting semantics.
ChatGPT for an example, is made to look as if it was a human writing, and conveniently has access to a dataset that encompass large portion of the internet, but it doesn’t know right or wrong. Ask some math to it, specially calculus, he shows a lot of things that look as if it makes sense, then check the results against wolfram alpha, and its just gibberish. Ask him about academic publications and he will create fake papers, authors and even DOIs for the fake papers, but it feels legit.
LLMs are text tools, not facts tools, they will create good text, with 0 guaratee of having any real facts in it.
AI in general, has nothing of that Intelligence everyone get so excited about. It is just statistics. The catch isn’t on the dataset it uses, but rather, what it is looking into inside the dataset?
ChatGPT looks into syntax, grammar, and some other things to make texts that look right.
A Soda with artificial flavour of Orange still tastes like Orange, no matter how crappy it is.
Artificial Intelligence is not even artificial, nevermind Intelligent.
Funny choice to make.
https://gpt.wolfram.com/
Tried to write a comment, but it went poof.
So! TLDR Version: Use techniques to control cognitive load and you’ll reduce the error rate.
Source: Built a major AI product with a major company that’s now doing complex tasks with grace and acceptable error rates to make it a useful tool for the product.
“Don’t make the thing that can’t really think the way you do think too hard or it’ll get all weird on you”
Okay, try number three to attempt to make a comment…
TLDR: Use cognitive load tactics.
Post would be longer, but this is the third time trying to post a comment…
I fail to see the great insight in this observation. It’s the fallacy to see these things as human, because they mime human experiences. But in human terms “there is nobody home” and the hope of researchers that this is “something that’ll just shake out” given enough data… is probably just “toxic positivity” (favorite term; explains so much, teams BSing themselves into better outcomes).
But one thing is for sure, things will have to get worse before they get better. There is too much money and momentum behind this technology, you WILL PAY for it. I just hope your self driving car won’t put me in a wheelchair, because my heatlhcare provider AI has become very good not paying for it.
Oh, also AIs don’t make mistakes, they have perfect replay, just save the state (pseudo random included) and run it again, same result.
The people that make, sell and operate AIs make mistakes. They just don’t want to be responsible for them.
We’ve known this forever: Garbage in, garbage out.
We’re discovering an interesting corollary:
When you’re building a statistical average of a large amount of input, you don’t get the peak ability of that input. You get a mediocre average idiot.
I’ve barely used chatgpt but Ive seen some pretty funny mistakes. I had it generate some code to initialize registers for the pi pico and it actually did pretty well, after a few bug fixes. But one issue was that a register had a value that seemed incorrect. So I asked it to show me the calculation. It responded with a series of nicely formatted latek style calculation steps, in such detail it would be considered patronizing if it was human. And one of the steps was something like 4000/50 = 200. So I respond “4000/50 is not 200”. And it hilariously goes “oh I’m sorry, that is wrong” and then it redid the calculation correctly. Lol.
I think what makes it weird is that it can makes extremely dumb mistakes, while simultaneously writing with impeccable grammar and formatting. Humans don’t usually work that way.
Hello, allow me to introduce myself… :D
LLM gives you 3 tipes os answers:
One obious, one truth and one wrong…and your job is find wich one is wich one.
I really think LLMs are the future of software development, but as they are not perfect, its a problem find where they fail cause they make better and better good looking code so its harder ans harder find where the bug is hidden.
Good for us, the solution of code is not the same as hard as find if an answer is a real truth, in software development the truth is what makes the test pass, so you should pay attention to the tests, even more, generate the tests in a isolated way from the code(if this is even possible). Also, once you have the tests you need, you can generate more and more iterations for “free” until all the tests pass.
A stopped clock is right twice a day, but you need another clock to know when. Why did you need the stopped clock again?
a clock running backwards at 43200x speed is right any second…
if you can ask a question so precise that a LLM gives a reliable answer, then a) you can already answer the question yourself and b) finding the right answer takes less effort than finding the right prompt for the LLM
What you think you know about AI was probably 100% correct 6 months ago, and it will probably be 100% wrong (or irrelevant) in 6 months from now. This is one pattern I have observed in the chatter on X over recent years, a lot of people looking foolish for not considering the temporal context of any factoid. Not a new phenomena but one that is, due to the rapidly accelerating pace of change, becoming increasingly impactful on people’s reputations.
OpenAI has had a colossal trainwreck trying to deliver a GPT 5. Everyone in the industry is stagnant, training near-identical models on near-identical datasets. It’d been literal years since the last big breakthrough. Nobody has any ideas at all about how to move the field forward. We have already passed peak LLM.
Your ignorance is showing, that is one irrelevance and lie after another. Claude.ai is the leader when it comes to coding and diagramming. Other AI’s have different distinct strengths and weaknesses, so it stands to reason that we can have progress by simply integrating all of those strengths into a single system, something that the DeepSeek team did in part actually do.
Meanwhile I’m running AI here that can spit out 3000 A4 pages of well structured and coherent text on any subject I want. The trick is to interleave LLM systems with hand coded human intelligence into tools that are far more powerful that some gloried chat bot.
Honestly, this sounds like the same excuses cryptobros made in 2021.
It isn’t an “excuse” it is an inconvenient fact that you don’t have the intellectual integrity to address directly. OMG the readership of HAD has gone down hill of late…
“Models may unpredictably appear to lack common sense.”
No, in fact they actually do have zero sense of any kind. They are returning the likely next action according to their training combined with a small amount of randomness for the appearance of variety. They don’t have sense, intelligence, thoughts, or value systems.
“Next word predictor can’t demonstrate intent, planning, or any type of higher-level thinking. In fact maybe the first clumsy attempts at AI aren’t similar to how human minds work at all. More news at 7”
If we sat down for an hour dinner and you started talking for 10 minutes then at the end of that dinner I pulled out a piece of paper that predicted every word you said after that 10 minutes, would that make you think different about yourself? What can you infer about me?
Do I understand you? Do I have a mental model of you?
This comment adds nothing of value.
I see the, “it just predicts the next word” argument everywhere. Even in this thread and I don’t understand why that is compelling because in the hypothetical I mention you would be more than impressed. and you would question a lot of things about yourself.
And your comment adds what? That you are an unimaginative jerk?
The ability to predict the future has nothing to do with intelligence, because it assumes either magic premonition which is nonsense, or a completely deterministic universe where nobody has any choice either way and “intelligence” as a concept is ultimately meaningless.
If you could predict every word I say for the next 10 minutes, I would simply have to conclude whatever I simply have to conclude, as defined by the laws of physics and circumstances since the beginning of all time and universe.
How can you say they are not related?
Clearly if you worked with an engineer who predicted outcomes that where consistently false then you would question his intelligence.
And sure that doesn’t mean it is equal to intelligence but that’s partly because you cant rigoursly define intelligence so I would argue it is you are retreating into this magical box.
Also I find it bizarre because I know you think deeply about physics and the nature world to conflate determinism and predictability, think quantum wave function.
Choice is another word loaded with magic. I’m talking about something you can measure. The only possible exception is creative intelligence which is just one of many definitions and most problematic because its loaded with magic.
If I met an engineer who didn’t already know his predictions are consistently false, then I would question his intelligence.
That’s a red herring though. We know that LLMs don’t “predict the next word”, they give randomized outputs, where the probabilities are simply skewed by a statistical model of the training set. It’s a Markov Chain of sorts, where the prompt you give sets the direction and the next words are decided essentially by flipping a cleverly weighted coin.
Point being that your ability or inability to predict the speech says nothing of the intelligence of the speaker. If you turn down the random number generator, the LLM will indeed produce entirely predictable outputs, and if you turn it up it will generate increasingly unpredictable nonsense.
Dude I know you have a deep knowledge of many things but you are way out of your element here.
They are absolutely trained to predict the next word based on some probability distribution and you can simply set the temperature to zero for the most likely next word from that distribution. It’s strange you wouldnt check this yourself before making that comment since it’s so well known.
Why do you think they spit out a token (~word) at a time? This autoregressive method has been around in ML circles for a long time. We literally mask the next token and compute a loss then propagate that back up through the weights which tunes them to predict that masked next word.
Humans definitely need to be trained on how to use AI and interpret the answers. I have already seen far too many confidently incorrect opinions that turned out to be based either on AI hallucinations or on browbeating the AI until it gives the answer you want. At least some online writers have the decency to begin their comment with “I asked ChatGPT about…” so I know to safely disregard their message, this sort of low-confidence information is just noise.
I have built several AI systems that were significantly more accurate and precise than their human counter parts.
This was on specific tasks like analyzing documents, generating questions, creating chapters, summarizing transcripts, etc. But you are right that you have force it to do the task properly and need to have good tests otherwise you should have no confidence in the response. Issue is when you generalize and assume confidence without quality control metrics.
“Models may unpredictably appear to lack common sense”
That’s because they don’t have any sense. They’re not reasoning at all. They’re ML pattern matching machines that people are dressing up to look like AI, when they aren’t – they’re just a possible component or stepping stone.