The last few weeks have seen a number of tech sites reporting on a robot which can find and point out Waldo in those “Where’s Waldo” books. Designed and built by Redpepper, an ad agency. The robot arm is a UARM Metal, with a Raspberry Pi controlling the show.
A Logitech c525 webcam captures images, which are processed by the Pi with OpenCV, then sent to Google’s cloud-based AutoML Vision service. AutoML is trained with numerous images of Waldo, which are used to attempt a pattern match. If a pattern is found, the coordinates are fed to PYUARM, and the UARM will literally point Waldo out.
While this is a totally plausible project, we have to admit a few things caught our jaundiced eye. The Logitech c525 has a field of view (FOV) of 69°. While we don’t have dimensions of the UARM Metal, it looks like the camera is less than a foot in the air. Amazon states that “Where’s Waldo Delux Edition” is 10″ x 0.2″ x 12.5″ inches. That means the open book will be 10″ x 25″. The robot is going to have a hard time imaging a surface that large in a single image. What’s more, the c525 is a 720p camera, so there isn’t a whole lot of pixel density to pattern match. Finally, there’s the rubber hand the robot uses to point out Waldo. Wouldn’t that hand block at least some of the camera’s view to the left?
We’re not going to jump out and call this one fake just yet — it is entirely possible that the robot took a mosaic of images and used that to pattern match. Redpepper may have used a bit of movie magic to make the process more interesting. What do you think? Let us know down in the comments!
an ad agency got a post on Hackaday.. the technicalities are the least of their concerns, their final objective is already accomplished!
Love over engineering. They could have used a haar cascade or something like workflow no? Plus I totally agree the camera is going to struggle to provide enough feature information.
Have you tried designing a haar cascade from scratch?
Seriously ..
I thin kyou are totally right!
the hand totally blocks the left view of the camera, the resolution of the camera is way too low and the field of view is also too low.
It looks like they did however have their neural network working in order, however that is super easy with Google’s AutoML:Vision beta.
A few random thoughts:
(1) The resolution at 0:23 is quite bad. But the resolution at 0:27 (which is presumably a zoomed-in version of 0:23) should be enough for feature detection. Perhaps the view at 0:23 is a poorly compressed image (blame YouTube?).
(2) It seems weird for someone to make the effort of setting up the machine learning, buying the arm, etc. and then ruin the project through fraud.
(3) I could only find the 0:58-second-long video. The project should really have better documentation! Perhaps there’s something weird, like another camera that takes the photo at 0:23 before the arm moves above the page?
“(2) It seems weird for someone to make the effort of setting up the machine learning, buying the arm, etc. and then ruin the project through fraud.”
If it is faked (I don’t have an opinion either way on that front) I would expect the reasoning to be: “We thought we could do this. We spent all this time and money trying to do this. It just doesn’t quite work out. So we’ll add this little bit of code that gives it a hint….”
That reasoning seems very plausible! If it is faked (I’m also not sure), I would believe that it happened that way.
I would guess that it’s based on Tadej Magajna‘s FasterRCNN model trained for Wally-finding. You can install TensorFlow and google up his GitHub easily enough.
You make a very valid point – half the challenge with ML computer vision systems has nothing to do with the neural network – it’s all in the image sensors, lenses, optics and lighting.
Only half? Garbage in is garbage out.
Ironically “garbage in,” is half of the old saying “garbage in, garbage out.”
Some. Covered in “Feature engineering for machine learning”, part of the current HB bundle.
Well it certainly seems artificial.. The intelligence part seems lacking though…
It might be a proof-of-concept level. All the elements are there, just maybe not working together 100%.
E.g. taking a high resolution picture of the book’s pages by hand, running that through OpenCV by hand, uploading the tens to hundreds faces to the google algorithm semi-automatically, showing the result and then just making a nice video of a robot arm pointing at some coordinates.
It’s enough to show that the technology works. The steps in between could be automated, but the additional effort isn’t really justified for something that doesn’t have any real purpose.
I think they add agency, giving the robot its ability to act within the environment? :)
(Sorry, hit report comment. Seriously, swap the reply and report buttons around!)
I like your version
Shouldn’t that be “… an ad agency”?
with these tools ( + “some” customization ;-), it should be possible
But I would expect the pi to divide the image ( by moving the camera ) in parts to have good resolution on each part
http://blog.dlib.net/2017/02/high-quality-face-recognition-with-deep.html
https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78
http://blog.dlib.net/2017/02/high-quality-face-recognition-with-deep.html
The article misses what seems like an intentional joke – the term “waldo” is used to describe a manipulator arm.
I thought I was the only one…
I doubt it’s a total fraud. Being shown as a fraud would completely negate the positives from having such a viral video.
But… heavily edited to show only the most positive aspects? Most definitely.
As others have mentioned, the camera and it’s placement cannot achieve high enough resolution images of the book pages. The obvious solution is multiple pictures stitched together – some type of page scanning routine. Clearly that’s within the reach of the hardware described, though not shown or mentioned.
Then there’s time. Much of this video is shown in “montage” which obviously breaks the linear timeline. I can make no assumptions or judgments as to how long any part of this process takes. It could easily be hours.
Calibration of the arm is another glossed over aspect. A hobby-servo driven arm with three-dimensional kinematics pointing to coordinates on a two dimensional picture. Solvable problems, but definitely non-trivial and completely glossed over in this video.
What’s the success rate of the whole process? Image capture, processing, identification, locating and pointing – In the video it’s 100%, but how many failures did they edit out?
Not “fake”, but much like every kickstarter video I’ve ever seen – much effort has been spent in the editing room to make this look as good as possible.
why people keep assuming it performs the lookup on the entire page at once? no “stitching” required. take a shot, is it there? no? move right, try again. Make it look good by memorizing the position and jumping straight there on command. ‘s how humans do it, no?
This Waldo game is rather poorly thought out. The characters are literally flat and there’s zero replayability. Once you’ve found Waldo on each page, that’s it.
You’re looking for the Amazon comments section. Don’t worry! It is a common mistake. Just exit this tab, go to amazon.com, search the Waldo game, and click on “Write a customer review”.
If it was real, they edited out it finding the stamp with Waldo’s face repeatedly.
Invite the inventor to a hackaday conference (especially near his home) to show/discuss his technology and if he declines, that may mean something fishy is up.