Voice Assistants, love them, or hate them, are becoming more and more commonplace. One problem for voice assistants is the situation of multiple devices listening in the same place. When a command is given, which device should answer? Researchers at CMU’s Future Interfaces Group [Karan Ahuja], [Andy Kong], [Mayank Goel], and [Chris Harrison] have an answer; smart assistants should try to infer if the user is facing the device they want to talk to. They call it direction-of-voice or DoV.
Currently, smart assistants use a simple race to see who heard it first. The reasoning is that the device you are closest to will likely hear it first. However, in situations with echos or when you’re equidistant from multiple devices, the outcome can seem arbitrary to a user.
The implementation of DoV uses an Extra-Trees Classifier from the python sklearn toolkit. Several other machine learning algorithms were considered, but ultimately efficiency won out and Extra-Trees was selected. Another interesting facet of the research was determining what facing really means. The team had humans ‘listeners’ stand in for smart assistants. A ‘talker’ would speak the key phrase while the ‘listener’ determined if the talker was facing them or not. Based on their definition of facing, the system can determine if someone is facing the device with 90% accuracy that rises to 93% with per-room calibration.
Often, when we think of getting a computer to complete a task, we contemplate creating complex algorithms that take in the relevant inputs and produce the desired behaviour. For some tasks, like navigating a car down a road, the sheer multitude of input data and its relationship to the desired output is so complex that it becomes near-impossible to code a solution. In these cases, it can make more sense to create a neural network and train the computer to do the job, as one would a human. On a more basic level, [Gigante] did just that, teaching a neural network to play a basic driving game with a genetic algorithm.
The game consists of a basic top-down 2D driving game. The AI is given the distance to the edge of the track along five lines at different angles projected from the front of the vehicle. The AI also knows its speed and direction. Given these 7 numbers, it calculates the outputs for steering, braking and acceleration to drive the car.
To train the AI, [Gigante] started with 650 AIs, and picked the best performer, which just barely managed to navigate the first two corners. Marking this AI as the parent of the next generation, the AIs were iterated with random mutations. Each generation showed some improvement, with [Gigante] picking the best performers each time to parent the next generation. Within just four iterations, some of the cars are able to complete a full lap. With enough training, the cars are able to complete the course at great speed without hitting the walls at all.
It’s a great example of machine learning and the use of genetic algorithms to improve fitness over time. [Gigante] points out that there’s no need for a human in the loop either, if the software is coded to self-measure the fitness of each generation. We’ve seen similar techniques used to play Mario, too. Video after the break.
Over the past decade, we’ve seen great strides made in the area of AI and neural networks. When trained appropriately, they can be coaxed into generating impressive output, whether it be in text, images, or simply in classifying objects. There’s also much fun to be had in pushing them outside their prescribed operating region, as [Jon Warlick] attempted recently.
[Jon]’s work began using NVIDIA’s GauGAN tool. It’s capable of generating pseudo-photorealistic images of landscapes from segmentation maps, where different colors of a 2D image represent things such as trees, dirt, or mountains, or water. After spending much time toying with the software, [Jon] decided to see if it could be pressed into service to generate video instead.
The GauGAN tool is only capable of taking in a single segmentation map, and outputting a single image, so [Jon] had to get creative. Experiments were undertaken wherein a video was generated and exported as individual frames, with these frames fed to GauGAN as individual segmentation maps. The output frames from GauGAN were then reassembled into a video again.
The results are somewhat psychedelic, as one would expect. GauGAN’s single image workflow means there is only coincidental relevance between consecutive frames, creating a wild, shifting visage. While it’s not a technique we expect to see used for serious purposes anytime soon, it’s a great experiment at seeing how far the technology can be pushed. It’s not the first time we’ve seen such technology used to create full motion video, either. Video after the break.
There is a certain magic and uniqueness to hardware, particularly when it comes to audio. Tube amplifiers are well-known and well-loved by audio enthusiasts and musicians alike. However, that uniqueness also comes with the price of the fact that gear takes up space and cannot be configured outside the bounds of what it was designed to do. [keyth72] has decided to take it upon themselves to recreate the smooth sound of the Fenders Blues Jr. small tube guitar amp. But rather than using hardware or standard audio software, the magic of AI was thrown at it.
In some ways, recreating a transformation is exactly what AI is designed for. There’s a clear and recordable input with a similar output. In this case, [keyth72] recorded several guitar sessions with the guitar audio sent through the device they wanted to recreate. Using WaveNet, they created a model that applies the transform to input audio in real-time. The Gain and EQ knobs were handled outside the model itself to keep things simple. Instructions on how to train your own model are included on the GitHub page.
With social media and online services are now huge parts of daily life to the point that our entire world is being shaped by algorithms. Arcane in their workings, they are responsible for the content we see and the adverts we’re shown. Just as importantly, they decide what is hidden from view as well.
Important: Much of this post discusses the performance of a live website algorithm. Some of the links in this post may not perform as reported if viewed at a later date.
Recently, [Colin Madland] posted some screenshots of a Zoom meeting to Twitter, pointing out how Zoom’s background detection algorithm had improperly erased the head of a colleague with darker skin. In doing so, [Colin] noticed a strange effect — although the screenshot he submitted shows both of their faces, Twitter would always crop the image to show just his light-skinned face, no matter the image orientation. The Twitter community raced to explore the problem, and the fallout was swift.
For his project, [Ish] used an Arduino Nano 33 BLE Sense due to its built-in microphone, sizeable RAM for storing large chunks of data, and it’s BLE capabilities for later connecting with an app. He began his project by collecting background noise using Edge Impulse Studio’s data acquisition functionality. [Ish] really emphasized that Edge Impulse was really doing all the work for him. He really just needed to collect some test data and that was mostly it on his part. The work needed to run and test the Neural Network was taken care of by Edge Impulse. Sounds handy, if you don’t mind offloading your data to the cloud.
[Ish] ended up with an 86.3% accurate classifier which he thought was good enough for a first pass at things. To make his prototype a bit more “finished”, he added some status LEDs, providing some immediate visual feedback of his classifier and to notify the caregiver. Eventually, he wants to add some BLE support and push notifications, alerting him whenever his baby needs attention.
The main controller board of the radio was intact, so it was easy to use all the preexisting hardware to control the speaker and to trigger a few of the Pi’s GPIO using the buttons and switches on the radio’s front panel. To add some artificial intelligence, they used Google’s AIY Voice Kit, allowing them to tap into Google’s seemingly endless artificial intelligence platform. This could be a “tables have turned moment,” but we’re probably being a bit too hopeful.
Anyway, they used a pretty interesting piece of software called Dialogflow that creates a somewhat natural conversational interaction akin to a chatbox. Dialogflow processes speech to text, as you would expect, but can also interpret contextual speech and provide contextual responses. Pretty neat…but maybe also a little creepy. Who knows? The jury is still out.