Graph showing accuracy vs model

Why You Shouldn’t Trade Walter Cronkite For An LLM

Has anyone noticed that news stories have gotten shorter and pithier over the past few decades, sometimes seeming like summaries of what you used to peruse? In spite of that, huge numbers of people are relying on large language model (LLM) “AI” tools to get their news in the form of summaries. According to a study by the BBC and European Broadcasting Union, 47% of people find news summaries helpful. Over a third of Britons say they trust LLM summaries, and they probably ought not to, according to the beeb and co.

It’s a problem we’ve discussed before: as OpenAI researchers themselves admit, hallucinations are unavoidable. This more recent BBC-led study took a microscope to LLM summaries in particular, to find out how often and how badly they were tainted by hallucination.

Not all of those errors were considered a big deal, but in 20% of cases (on average) there were “major issues”–though that’s more-or-less independent of which model was being used. If there’s good news here, it’s that those numbers are better than they were when the beeb last performed this exercise earlier in the year. The whole report is worth reading if you’re a toaster-lover interested in the state of the art. (Especially if you want to see if this human-produced summary works better than an LLM-derived one.) If you’re a luddite, by contrast, you can rest easy that your instincts not to trust clanks remains reasonable… for now.

Either way, for the moment, it might be best to restrict the LLM to game dialog, and leave the news to totally-trustworthy humans who never err.

Your LLM Won’t Stop Lying Any Time Soon

Researchers call it “hallucination”; you might more accurately refer to it as confabulation, hornswaggle, hogwash, or just plain BS. Anyone who has used an LLM has encountered it; some people seem to find it behind every prompt, while others dismiss it as an occasional annoyance, but nobody claims it doesn’t happen. A recent paper by researchers at OpenAI (PDF) tries to drill down a bit deeper into just why that happens, and if anything can be done.

Spoiler alert: not really. Not unless we completely re-think the way we’re training these models, anyway. The analogy used in the conclusion is to an undergraduate in an exam room. Every right answer is going to get a point, but wrong answers aren’t penalized– so why the heck not guess? You might not pass an exam that way going in blind, but if you have studied (i.e., sucked up the entire internet without permission for training data) then you might get a few extra points. For an LLM’s training, like a student’s final grade, every point scored on the exam is a good point. Continue reading “Your LLM Won’t Stop Lying Any Time Soon”

Teaching A Robot To Hallucinate

Training robots to execute tasks in the real world requires data — the more, the better. The problem is that creating these datasets takes a lot of time and effort, and methods don’t scale well. That’s where Robot Learning with Semantically Imagined Experience (ROSIE) comes in.

The basic concept is straightforward: enhance training data with hallucinated elements to change details, add variations, or introduce novel distractions. Studies show a robot additionally trained on this data performs tasks better than one without.

This robot is able to deposit an object into a metal sink it has never seen before, thanks to hallucinating a sink in place of an open drawer in its original training data.

Suppose one has a dataset consisting of a robot arm picking up a coke can and placing it into an orange lunchbox. That training data is used to teach the arm how to do the task. But in the real world, maybe there is distracting clutter on the countertop. Or, the lunchbox in the training data was empty, but the one on the counter right now already has a sandwich inside it. The further a real-world task differs from the training dataset, the less capable and accurate the robot becomes.

ROSIE aims to alleviate this problem by using image diffusion models (such as Imagen) to enhance the training data in targeted and direct ways. In one example, a robot has been trained to deposit an object into a drawer. ROSIE augments this training by inpainting the drawer in the training data, replacing it with a metal sink. A robot trained on both datasets competently performs the task of placing an object into a metal sink, despite the fact that a sink never actually appears in the original training data, nor has the robot ever seen this particular real-world sink. A robot without the benefit of ROSIE fails the task.

Here is a link to the team’s paper, and embedded below is a video demonstrating ROSIE both in concept and in action. This is also in a way a bit reminiscent of a plug-in we recently saw for Blender, which uses an AI image generator to texture entire 3D scenes with a simple text prompt.

Continue reading “Teaching A Robot To Hallucinate”