Large language models (LLMs) are wholly dependent on the quality of the input data with which these models are trained. While suggestions that people eat rocks are funny to you and me, in the case of LLMs intended to help out medical professionals, any false claims or statements dripping out of such an LLM can have dire consequences, ranging from incorrect diagnoses to much worse. In a recent study published inΒ Nature Medicine by [Daniel Alexander Alber] et al. the ease with which this data poisoning can occur is demonstrated.
According to their findings, only 0.001% of training tokens have to be replaced with medical misinformation to order to create models that are likely to produce medically erroneous statement. Most concerning is that such a corrupted model isn’t readily discovered using standard medical LLM benchmarks. There are filters for erroneous content, but these tend to be limited in scope due to the overhead. Post-training adjustments can be made, as can the addition of RAG, but none of this helps with the confident bull excrement due to corruption.
The mitigation approach that the researchers developed cross-references LLM output against biomedical knowledge graphs, to reduce the LLM mostly for generating natural language. In this approach LLM outputs are matched against the graphs and if LLM ‘facts’ cannot be verified, it’s marked as potential misinformation. In a test with 1,000 random passages detected issues with a claimed effectiveness of 91.9%.
Naturally, this does not guarantee that misinformation does not make it past these knowledge graphs, and largely leaves the original problem with LLMs in place, namely that their outputs can never be fully trusted. This study also makes it abundantly clear how easy it is to corrupt an LLM via the input training data, as well as underlining the broader problem that AI is making mistakes that we don’t expect.
A really good reason why LLMs should not be used in any application where they could put lives in danger. Also, the preponderance of “confident bull excrement” AND the simultaneous preponderance of equally confident bull excrement saying the exact opposite in the politicosphere is probably creating schizophrenic AIs. Don’t trust anything without a face. Or with the wrong number of fingers.
Humans are worse, try discussing certain topics on HAD and see how fast things descend into innumerate madness driven by dogma and political indoctrination.
So true, certainly in this time we live in. Itβs the same with self driving cars, they can drive a million kmβs without any problem, but if they are involved in an accident the media are full of it and we should immediately remove that tech from the road. We are nowhere near using LLMs as autonomous unsupervised doctors, but the output quality when fed with the right input is remarkable also in medical situations. Where i live law dictates that humans should get a human advice/decision so we only use AI to help us generate the text and the doctor is always responsible for what is sent out to the patients.
Humans are also vulnerable to this.
But we can get useful work out of humans even if they are wrong about stuff.
They only behave when they fear the consequences, then they are often smart enough to say “I don’t know”, and not help at all.
Most humans are smart enough to be able to deal with the really tiny portion of crap though. Heck even in the propaganda only zone folks will know what the truth isn’t most of the time, even if they never ever talk about it and don’t have a way to know the full truth – the logical inconsistency etc feeds into an understanding of the reliability of the sources and what the truth must be. Which so far at least is something LLM just don’t do for themselves.
Human doctors, are absolutely fallible and they make mistakes, and on top of that they often have very unhealthy superiority complexes that make them extremely difficult to communicate with.
I personally think that a well trained A.I. doctor will be much safer and more useful as a practicing doctor. I will elaborate my position, for one there was a heart surgeon who gave over 62 completely unnecessary open heart surgeries just to increase his overall profits, this doctor had hundreds of trusting patients many of which he literally risked their lives and health just to make allot of money.
Also many doctors have been exposed to be perverts. Basically greed and lust, have compromised many doctors and created an environment wherein only simpletons still blindly trust their doctor.
Oh and dare I forget the amazing trend of doctors leaving surgical instruments inside people’s bodies during surgeries.
I’m betting a robot doctor will be much much more likely to save a higher percentage of human lives, especially based upon what I have previously mentioned concerning the many and varied and dangerous flaws of human doctors.
“..a well trained A.I. doctor will be much safer..”
And there’s the rub. If your AI is trained with a large enough base of data, it’s too large to have been reviewed by people for reliability. The best you can do is review its diagnoses from the outcomes, which is what you get with human doctors.
LLM’s greatest contribution thus far is to sow mistrust around its outputs by hallucinating or miscorrelating information. Natural language is not merely textual, LLM’s use a logic which is correct within its own black box which has no useful meaning in actual reality. The reason computer code can be synthesized better by an LLM, than making a medical diagnosis, is programming languages are far less expressive and far more specific in what they communicate
“only 0.001% of training tokens have to be replaced”
good luck sneaking 100 bibles worth of poison in something like GPT4 to have an effect…
you think GPT4’s inputs have less poison than that?
You just use another AI to generate the poison. It could do that during a lunch break. Duh
Acording to a recent article on Ars this process may be (unintentionally?) happening already thanks to less scrupulous scientists.
https://arstechnica.com/science/2025/01/bogus-research-is-undermining-good-science-slowing-lifesaving-research/
Patient: I want a second opinion!
DoctorGPT: I’m happy to change my original opinion, what would you prefer to be diagnosed with?
I feel like we need to have some level of spliting models into language and facts. Not just for accuracy but for longer term memory.
For example, a model running a DnD session would work worlds better if it had a rigid database that would provide more and more dense context. Like something that could handle keeping track of HP, spells, previous key story points, or even just character names, personalities and histories.
A rigid database it could access would make fact sourcing and accuracy easier to manage since entries could be added for new information or updated as needed by humans.
Even a more distict but AI based memory would help more than context windows currently do. Though I think a lot could be done by attaching something that gives some rigid context to prompts like a bullet list of stats and is controlled independently based on events.
Sitting down and reading the entirety of Dissolving Illusions by Suzanne Humphries to a medical AI
I am wondering why its even an issue if LLMs can get infected or hallucinate?
I thought it was already established that they are statistical text prediction engines, and not actually able to think or reason. If you input 1+1=3 enough times during training, it will consider that the truth. As long as there’s a disclaimer when using an LLM so that no one ever forgets its true nature, I see no problem
LLMs are a search engine.
Why is it remotely surprising that they will return bad information, when you include bad information in the original dataset?
This problem is EASILY solved.
(The problem of them being wrong. Not the moral problems of using them for anything at all.)
You vet 100% of the dataset.
Then you use the LLM like the search engine it is.
Why do people not understand this?
If we stopped treating all this “AI” BS like some magical fairy that knows things instead of the literal search engine it is, people would have way more realistic expectations.
But that wouldn’t drive the trillion dollar “AI” grift would it?
People might be a little less comfortable openly using other people’s work if they had to put LLM output into the same mental box as a Google search huh?
LLMs and Generative “AI” are theft.
They are search engines that make it convenient and/or palletable to launder the work of others without credit or payment.
We can’t even vet 100% of Wikipedia.
Let’s see, a physician or physician’s assistant reading my incomplete medical records for the first time given 10-15 minutes to diagnose me or an AI expert system with my entire medical history working with data even from the latest studies and OUTLINING ITS THOUGHT PROCESS in coming up with a diagnosis which is quickly reviewed by a physician or physician’s assistant to eliminate ridiculous prescriptions like “eat rocks.” Also, unlike humans, the AI would also be sure to be programmed to NOT be ruled by financial incentives from Big Pharma and be programmed to IGNORE studies from same (see below).
I’ll take AI, thank you.
“It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as editor of The New England Journal of Medicine.” – Marcia Angell (2009)
“The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.” – Richard Horton, editor of The Lancet (2015)
“Journals have devolved into information laundering operations for the pharmaceutical industry.” – Richard Horton, editor of The Lancet (2004)
BTW, I looked at the Authors and Affiliations section of the study and see that nearly every one has what could be called an “employment conflict of interest” of AI taking their jobs.
Odds are that is exactly what it wouldn’t be – training these LLM and certifying them fit for medical use is going to be expensive. So just who do you think will be paying for it? IF anything it is more likely to be given priority on the more profitable or still protected IP drugs by filtering the input data etc.
Not sure I’d trust that either – databases get borked, you get initially misidentified as your brother after an accident involving both of you etc. The human is likely to notice the data doesn’t match reality – for instance at some point nearly everyone who ever had an update/edit in a database my Dad worked for the tech support company ended up gender bent, IIRC the software they where using didn’t do scroll wheel navigation properly so the user wants to get down to the bottom of form the first field that was auto selected on opening an edit instead gets rolled down, so everyone ended up with a female title, and as there was no message before committing the update informing you of all the changes that would go through. (That took a long time to debug as technically there is nothing wrong with the database, and the user interface works perfectly unless you used the scroll wheel at the wrong time).
I’ve been playing with putting multiple AI systems into adversarial interactions to make them more intellectually disciplined, and it works well. Just like humans they can all have a head full of BS but so long as they don’t all subscribe to the same BS you can get a sane consensus out of them. Not much different from managing humans in a high pressure production environment, if you know what that is like…