Training robots to execute tasks in the real world requires data — the more, the better. The problem is that creating these datasets takes a lot of time and effort, and methods don’t scale well. That’s where Robot Learning with Semantically Imagined Experience (ROSIE) comes in.
The basic concept is straightforward: enhance training data with hallucinated elements to change details, add variations, or introduce novel distractions. Studies show a robot additionally trained on this data performs tasks better than one without.
Suppose one has a dataset consisting of a robot arm picking up a coke can and placing it into an orange lunchbox. That training data is used to teach the arm how to do the task. But in the real world, maybe there is distracting clutter on the countertop. Or, the lunchbox in the training data was empty, but the one on the counter right now already has a sandwich inside it. The further a real-world task differs from the training dataset, the less capable and accurate the robot becomes.
ROSIE aims to alleviate this problem by using image diffusion models (such as Imagen) to enhance the training data in targeted and direct ways. In one example, a robot has been trained to deposit an object into a drawer. ROSIE augments this training by inpainting the drawer in the training data, replacing it with a metal sink. A robot trained on both datasets competently performs the task of placing an object into a metal sink, despite the fact that a sink never actually appears in the original training data, nor has the robot ever seen this particular real-world sink. A robot without the benefit of ROSIE fails the task.
Here is a link to the team’s paper, and embedded below is a video demonstrating ROSIE both in concept and in action. This is also in a way a bit reminiscent of a plug-in we recently saw for Blender, which uses an AI image generator to texture entire 3D scenes with a simple text prompt.
9 thoughts on “Teaching A Robot To Hallucinate”
Sounds similar to taking a nap as a human being so the brain can digest new information during some heavy studying or learning new complex tasks. Only we call it dreaming instead of hallucinating…
Yeah that was what I was thinking, this is definitely dreaming rather than hallucinating. Hallucinating would be if the robot was there physically and was having its camera feed altered with other objects added in but this is just altering it’s training data when the robot isn’t physically powered on, so it is just dreaming.
The neural net is merely trained on simulated data AND real data, instead of just the real data set. It does not generate the imagery.
It still carries the problem that you have to train the model for each special case, where for example placing the item in a basket instead of a drawer or a sink requires you to start over and add images of baskets to the training set, which would eventually have to contain the entire world because the robot itself does not understand what it is seeing and cannot generalize.
Do Androids Dream of Electric Sheep? Wow that’s some insane stuff. REALLY SRANGE TIME TIMES!
Wasn’t there an SF story where the robot drank? I’m thinking of Eando Binder, but a search says no.
But.maybe I’m thinking of Bender on Futurama
“Hallucinating” is the term of art used in neural network farming to mean “making up random crap”.
Recent developments in understanding human dreams is that dreams may be used by the brain as a means to improve its neural network. Sounds like this may be similar to using these hallucinations for robots…..
Anyone else watch that animation and immediately think of Bojack Horseman?
A very similar technique is used by Nvidia for SLNN training: simulate multiple environments and render them at high fidelity, then train the visual model on those (e.g. simulating driving footage from multiple virtual ‘cameras’ to train self-driving car object recognition algorithms). The advantage of using fully synthetic training environments is that by definition you have an exact baseline for every single item in the training dataset (i.e. unlike a captured training set you do not have to have people go through and tag them), with the disadvantage being that utility of the generated training sets is related to how realistically you can render them – something Nvidia has been pushing hard for e.g. raytracing.
Please be kind and respectful to help make the comments section excellent. (Comment Policy)