Robots Learning To Understand Their Surroundings

Today it is pretty easy to build a robot with an onboard camera and have fun manually driving through that first-person view. But builders with dreams of autonomy quickly learn there is a lot of work between camera installation and autonomously executing a “go to chair” command. Fortunately we can draw upon work such as View Parsing Network by [Bowen Pan, Jiankai Sun, et al]

When a camera image comes into a computer, it is merely a large array of numbers representing red, green, and blue color values and our robot has no idea what that image represents. Over the past years, computer vision researchers have found pretty good solutions for problems of image classification (“is there a chair?”) and segmentation (“which pixels correspond to the chair?”) While useful for building an online image search engine, this is not quite enough for robot navigation.

A robot needs to translate those pixel coordinates into real-world layout, and this is the problem View Parsing Network offers to solve. Detailed in Cross-view Semantic Segmentation for Sensing Surroundings (DOI 10.1109/LRA.2020.3004325) the system takes in multiple camera views looking all around the robot. Results of image segmentation are then synthesized into a 2D top-down segmented map of the robot’s surroundings. (“Where is the chair located?”)

The authors documented how to train a view parsing network in a virtual environment, and described the procedure to transfer a trained network to run on a physical robot. Today this process demands a significantly higher skill level than “download Arduino sketch” but we hope such modules will become more plug-and-play in the future for better and smarter robots.

[IROS 2020 Presentation video (duration 10:51) requires free registration, available until at least Nov. 25th 2020. One-minute summary embedded below.]

Continue reading “Robots Learning To Understand Their Surroundings”

Background Substitution, No Green Screen Required

All this working from home that people have been doing has a natural but unintended consequence: revealing your dirty little domestic secrets on a video conference. Face time can come at a high price if the only room you have available for work is the bedroom, with piles of dirty laundry or perhaps the incriminating contents of one’s nightstand on full display for your coworkers.

There has to be a tech fix for this problem, and many of the commercial video conferencing platforms support virtual backgrounds. But [Florian Echtler] would rather air his dirty laundry than go near Zoom, so he built a machine-learning background substitution app that works with just about any video conferencing platform. Awkwardly dubbed DeepBackSub — he’s working on a better name — the system does the hard work of finding the person in the frame with Tensorflow Lite. After identifying everything in the frame that’s a person, OpenCV replaces everything that’s not with whatever you choose, and the modified scene is piped over a virtual video device to the videoconferencing software. He’s tested on Firefox, Skype, and guvcview so far, all running on Linux. The resolution and framerates are limited, but such is the cost of keeping your secrets and establishing a firm boundary between work life and home life.

[Florian] has taken the need for a green screen out of what’s formally known as chroma key compositing, which [Tom Scott] did a great primer on a few years back. A physical green screen is the traditional way to do this, but we honestly think this technique is great and can’t wait to try it out with our Hackaday colleagues at the weekly videoconference.