Robots Can Finally Answer, Are You Talking To Me?

Voice Assistants, love them, or hate them, are becoming more and more commonplace. One problem for voice assistants is the situation of multiple devices listening in the same place. When a command is given, which device should answer? Researchers at CMU’s Future Interfaces Group [Karan Ahuja], [Andy Kong], [Mayank Goel], and [Chris Harrison] have an answer; smart assistants should try to infer if the user is facing the device they want to talk to. They call it direction-of-voice or DoV.

Currently, smart assistants use a simple race to see who heard it first. The reasoning is that the device you are closest to will likely hear it first. However, in situations with echos or when you’re equidistant from multiple devices, the outcome can seem arbitrary to a user.

The implementation of DoV uses an Extra-Trees Classifier from the python sklearn toolkit. Several other machine learning algorithms were considered, but ultimately efficiency won out and Extra-Trees was selected. Another interesting facet of the research was determining what facing really means. The team had humans ‘listeners’ stand in for smart assistants.  A ‘talker’ would speak the key phrase while the ‘listener’ determined if the talker was facing them or not. Based on their definition of facing, the system can determine if someone is facing the device with 90% accuracy that rises to 93% with per-room calibration.

Their algorithm as well as the data they collected has been open-sourced on GitHub. Perhaps when you’re building your own voice assistant, you can incorporate DoV to improve wake-word accuracy.

Thanks [Karan] for sending this in!

25 thoughts on “Robots Can Finally Answer, Are You Talking To Me?

  1. But .. since the assistant has to recognize the spoken commands, wouldn´t it be easier e less error prone to give said assistant a name and use it in commands directed to them ? If one is watching what is cooking in a pan, for example, and need to give some order to the (smart?) assistant, this one maybe cannot stop paying attention to the pan to search and look at the voice assistant.

    If I have to look at the machine, it kinda defeats the reason / utility for having it listening.

    1. Yes there should be a more fluid naming and command environment. Instead of Hey Google Turn On TV….it should be Turn on TV,…..Turn on Living room lights…..Set house AC on……False triggers would happen more which just shows that our Star Trek/Jetson home is not there yet. Speaking of false triggers….one time the TV was on and in a comedy show there was a sexual innuendo type joke about self pleasure….. suddenly my Lenovo/Google piped up and said ” Sorry I can’t help you with that…”

      1. You both seem to forget the most important bit – politeness.
        The device must not react until a “please” is heard.
        Otherwise, kids who grew up with these assistants will end up with bad manners and turn into cheeky brats.
        Seriously, this command style belongs to the military, not into a civilized home. Think about it. PLEASE.

        1. I agree with you, it would be importante also, and could even help in the workings of the software. The command sequence starts with the name assigned to the device “Jarvis” then the command “turn on the lights” and then the end-of-command marker : “Please” .

          The software could have two points to improve the recognizition of the commands.

          But giving unique names to the assistants is also important. The Jetsons called their robot Rosie, if memory serves well. And when you are working, say, in the underside of your car, you don´t get out from under it, look directly at your helper/son/friend/whatever and ask for a wrench. You would call them by name , and ask for said wrench. Same true and tested thing could work for the voice assistants, for the people that want to use them.

        1. Hi! That’s an interesting thought! From what I read years ago, some of the usual speakers are very limited. Their electronics constantly listen to a magic word, say “Alexa”, “Google”, “Computer” etc. If they think they heared it, they make a sound or blink a light and start transmitting audio data to the company’s server which does further analysis or voice/speech recognititon.

  2. DoV should be additional and not heavily weighted by the device. Do you physically turn and face each person when you talk them ? Not usually. Having to face each device to command it would be like Scotty picking up a mouse and saying “Hello computer”. I command my Google controlled devices as I walk by it. If it has to take additional time to decide if I am facing it or another device would either delay the actions or cause it to not act.

  3. I can’t stand cloud dependent, data mining, voice operated clappers.
    I think the Flintstones had it right. Semi intelligent biomechanism helpers is the future. They work for their own self interest instead of some corporate entity. Just don’t piss them off with your own notions of superiority and self importance and the world will be a better place.
    As a plus, if you can live harmoniously with them, you’re more likely to be socially acceptable to your own species.

    1. Devices, plural please. I was pleased when WILL NPR station referred to one of those as a “listening device” while promoting that new trendy way to tune in! With MEMS serial mics in anything and everything it will be incumbent to privacy to constantly debug things with an ice pick and a scan. Some people have to open their phone and jab 3 or more mics. To talk they just use a headset plugged in. Please don’t name things. Storms, diseases, portals, etc. must not become humanly personal.

    2. And as is common in politics and corporate advertising, if you say it enough times…. it must be true….. so they drone on about some solution to a none existent problem…. hoping to normalise owning these spyboxes aka slaves for their corporate bosses.

  4. Looks useful, I’m sure we’ll see this in future products.

    Also: while I love hackaday and read it all day every day, I need to know: is there an editor? The punctuation errors are egregious.
    “Voice Assistants, love them, or hate them, are becoming more and more commonplace”
    The comma after “love them” should not be there. Also the semicolon later in the paragraph should be a colon.
    HaD’s would benefit from more punctuation and grammar screening in its CMS.

    1. ” I need to know: is there an editor? The punctuation errors are egregious.”

      Welcome to Hackaday!

      Everyone who comments, is a viable punctuation and spelling proofer.
      It is also known as “audience participation”.

      1. I mean even “more and more commonplace” is problematic: why the second more?

        Other than the Wikipedia approach, has anybody ever seen community editing features that work?

  5. I imagine a scenario where all these [expletive deleted] start responding to each other.
    Tree branch hits side of house.
    Digital Photo Frame: Are you speaking to me?
    Window blinds (misunderstanding): Adjust 2 degrees?
    Refrigerator: 2 degrees warmer or cooler?
    Thermostat: Adjusting room temperature!
    Wash machine: Adding 2 cups of Cheer!
    Alexa: Ordering 2 kegs of beer!

    All this while the owner is away on vacation.

    1. Literally laughing out loud after reading this!!! I really hope James Veitch adds to his “Siri vs Alexa” video using your comment as the script foundation. So true.

      I just switched to a WiFi thermostat, but really not a fan of most “smart” devices as most just create laziness. I chose the thermostat I did because it can operate 100% WITHOUT wifi. Wifi simply adds smartphone app control (well, and of course the theft of my usage metrics)

  6. What a threshold we’ve crossed. Yet, I have to qualify this technological milestone with the buzz-killing disclaimer that this algorithmic simulation of intentional directedness in no way captures the form of ontological nearness that human communications denotes; that is, a computer assessing ontic-distance as a barometer for our intentionality to communicate—say, via ‘attributive reference,’ has far more to do with conscious intention correlated always-already with a manner of acoustic response (an interplay of graphemetics and phonetics grounded in normative discursivity delivers over, or manifests simultaneously, not only the appropriate sound emission to the intended listener, but the protocol for hearing the acoustic blast is beset as well). An embodied praxis has assimilated itself cognitively, linguistically, behaviorally, and environmentally. If we really want to push through to an AI one day intended to simulate something like Human Intelligence rather than mere General Intelligence (which is still theoretical itself) we need more algorithms that are somehow open-ended enough to learn from the environment, e.g., similar to the robot that maps where an object is in a room by running into it, retaining the data, and recalibrating its course based on layers of prior information. This is a dialectic of receptivity and retention; the rub is to figure out the former part without having to delimit its parameters of possibility to a narrow margin for the sake of functionality that tapers off unforeseen scenarios of possible receptivity. The only alternate path I can imagine, conceptually, would be: if the parameters of the algorithm allowed for the assimilation of dialectical-environmental capable of exceeding its initial encoding. This would be analogous to the machine acquiring enough information to inform itself of the necessity of rewriting its own code (which, by definition I guess—is “AI”). Is this possible? What a paradox! An algorithm whose parameters allow for adjustments or the assimilation of information capable of reconfiguration…but is this possible to extend beyond mere parts of the encoding? Hermeneutically speaking, do we have an issue of part-whole when it comes to the self-reflexive possibility of a machine assimilating information capable of revamping, not only mere regulative parameters (e.g., ‘turn left V right’ if an obstruction is hit) but the base parameters constitutive of the whole program?

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.