Speech Recognition For Linux Gets A Little Closer

It has become commonplace to yell out commands to a little box and have it answer you. However, voice input for the desktop has never really gone mainstream. This is particularly slow for Linux users whose options are shockingly limited, although decent speech support is baked into recent versions of Windows and OS X Yosemite and beyond.

There are four well-known open speech recognition engines: CMU Sphinx, Julius, Kaldi, and the recent release of Mozilla’s DeepSpeech (part of their Common Voice initiative). The trick for Linux users is successfully setting them up and using them in applications. [Michael Sheldon] aims to fix that — at least for DeepSpeech. He’s created an IBus plugin that lets DeepSpeech work with nearly any X application. He’s also provided PPAs that should make it easy to install for Ubuntu or related distributions.

You can see in the video below that it works, although [Michael] admits it is just a starting point. However, the great thing about Open Source is that armed with a working set up, it should be easy for others to contribute and build on the work he’s started.

IBus is one of those pieces of Linux that you don’t think about very often. It abstracts input devices from programs, mainly to accommodate input methods that don’t lend themselves to an alphanumeric keyboard. Usually this is Japanese, Chinese, Korean, and other non-Latin languages. However, there’s no reason IBus can’t handle voice, too.

Oddly enough, the most common way you will see Linux computers handle speech input is to bundle it up and send it to someone like Google for translation despite there being plenty of horsepower to handle things locally. If you aren’t too picky about flexibility, even an Arduino can do it. With all the recent tools aimed at neural networks, the speech recognition algorithms aren’t as big a problem as finding a sufficiently broad training database and then integrating the data with other applications. This IBus plugin takes care of that last problem.

32 thoughts on “Speech Recognition For Linux Gets A Little Closer

    1. You’re right, they should use MacOS X which supports strong password validation.

      https://hackaday.com/2018/01/12/apple-passwords-they-all-just-work/

      I think you’ll find that historically X has had a pretty spotty security reputation, which is why X no longer listens on a network socket by default. There is also a move to replace it with something more modern in the form of Wayland, which is still a way off, but showing promise.

      At least with X, a malformed truetype font won’t crash your machine: https://www.exploit-db.com/exploits/38713/

      1. You can have X11 security via Xpra. Add firejail and you get extra protections (filesystem, memory, etc). What does that mean? That X11 can be secure, because it’s not a protocol requirement to give open access (otherwise X11 apps would not work inside Xpra) but a total “don’t care” by implementers, and Wayland doesn’t help with other security issues.

        Meanwhile, Wayland starts to see the first design errors. Look up “The Wayland Zombie Apocalypse is Near”. They also ditched network transparency, then had to add XWayland once users said it matter. So we still have X11, and then some more.

        I see a trend there: reinvent and promise the sky, instead of doing that hard last 20% that will take the 80% of effort.

        PS: Also X11 had SECURITY extension for a long time. Again, it seems nobody cares to use it properly or patch any issues left. Just better rewrite. Promise this time it will be right.

        1. True, and I’m one of those users that does use X11 networking … most of the time via a SSH tunnel.

          It has its uses.

          I think people forget that the X protocol we know today, has been around close to 30 years … long before multi-monitor set-ups were affordable, long before hotplug, long before hardware 3D acceleration was affordable, long before lots of things we take for granted today.

          The fact that such an old protocol has maintained a grip for so long, is testament to its design. Yes, things have been bolted on, some things are clunky, some things are egregious security holes. It sticks around because nothing better has come along to replace it, and for most people, it’s doing the job just fine. I think with time the security issues you mention will be addressed, as in this day and age, we really can’t afford to ignore them.

          The Internet is not a little village any more!

          Of late there has been a drive to strip out some of the legacy cruft out of X11 to make it more lightweight. Wayland is an attempt at doing something from scratch based on the lessons from X11 and DirectFB. Will it succeed? No idea. It could be as bad as you suggest. I’m yet to actually dabble with it, X.org works well enough for my needs.

          I’ve seen areas where X11 falls down for others, and it’s a valid question as to whether it’s fixable or whether a new approach is needed.

          1. My pet peeve is how so many apps forget that some of use X11 as was intended by the Great Bird of the Galaxy. Chrome is a big offender. Try opening Chrome over X11. First, unless you have a really good x server you choke on not having OpenGL. So you shut all that down and then you find out that if you have Chrome running on your main screen, any subsequent opens go to that same screen regardless of $DISPLAY. ARGGGGGGHHHHH.

            So then you use xpra or x2go to copy your whole screen over. What a pain.

          2. This isn’t X’s fault. This is FF and Chrome trying to be “clever”, IGNORING X, and saying “Oh I see you already have an instance running, obviously it would be more efficient to open this tab/window in that existing instance rather than starting a new one”. https://bugzilla.mozilla.org/show_bug.cgi?id=135137 is SIXTEEN YEARS OLD. In FireFox’s case you can work around the issue by using “firefox –ProfileManager” to create a 2nd profile for the 2nd $DISPLAY, then use lots of “firefox –profile”. You might even want to write a little wrapper to use $DISPLAY to automatically select the correct “firefox –profile”.

          3. Ohh I know it’s not X11’s fault at all. The fault lies squarely at the feet of Mozilla in that case… and at Google’s in the case of Chrome.

            It’s the sort of arrogance they demonstrate: you couldn’t *possibly* want to run two instances of the same browser from different computers, *on the same screen*!

    2. And when you close that “security hole” you lose the ability to automate anything and everything that relies on a GUI. That’s the problem with being hyper-focused on security. The most secure computer is actually just a paper weight that doesn’t do anything!

      I would much rather have the following:

      – Applications can trigger evens on other applications but only if they are running on the same instance of X. This allows for automation, voice control, etc… Malicious programs could take advantage of it but the solution is don’t run shit if you don’t know where it came from!

      – Preserve remote display functionality. Keep X as a server and keep IP as one of the ways to access it. Local applications could benefit from a second way to talk to X, probably some form of DMA. All application, per compiling against a common shared library should still have the ability to at least try to connect via IP. If it is to slow it is too slow. Let the user decide when the lag is acceptable and when it is not.

      Adding encryption and authentication to the networking support would be a very good thing.

      1. Nutshell: I think there’s existing ways to do everything you just said:
        Applications / events already work as you described. You can’t trigger events on OTHER X instances unless you can connect to / authenticate against them. There’s also ways for apps to “bind focus” if (for example) you want your password manager to accept a password without other apps being able to sniff it, or to refuse simulated events that came from other apps instead of hardware.
        For “non-tcp”, see “unix domains sockets”, as already supported by X. I’m typing this into a browser that connected to an X “-nolisten tcp” via /tmp/.X11-unix/X0 instead of TCP. This seems to be the default in Ubuntu these days. It’s not “DMA” as such, but DMA IS used by extensions such as GLX and DRI once they have arranged how to do so over tcp or unix-domain sockets.
        Authentication already exists (xauth). Transport encryption and an extra layer of authentication is typically done via “ssh $host -X” – no point in reinventing a really great wheel.

  1. 10 years ago,Jabberwocky, the Alice chatbot and TTS was put together on a Linux system. There used to be a video of that with a conversational Halloween skull on YT. It was crude, but ahead of the competition.

  2. Cool stuff! The big boys tend to do this in the cloud with big number crunching and lots of collected data – if Linux or other open source platform wanted to go that route, could something sufficiently secure be cooked up to do this distributed across all the users? Is there an example of this kind of thing other than the likes of folding@home?

      1. Ooh… have you seen the really poor quality of many closed captioning? Movies and stuff are usually OK but anything like news is generally very poor. On top of that, I notice a lot on TV shows there will be things that clearly were last minute cuts so there will be extra audio dialog or lines in the CC that are not in the audio track.

        1. Depends on the how they caption.
          Movies and shows are often post production pop up captioned. News can be telecaster captioned, realtime captioned, or a combination. I used to work in the industry. Network news and tv shows are usually the most accurate. Still could be used as a useful training model.
          Use a limited vocabulary system to grade the show then use highly graded shows for you teaching model.
          If you could get a court reporter that uses audio sync while reporting to give you the transcripts and audio it could be a really big help.

    1. Once again I feel like Linux is being left in the dust. Unless they get some benevolent donation from a big player (e.g. Google gifts you TensorFlow), there is a growing gap between what you can get in Linux and what the pro software can run on a proprietary system.

      FOSS is really far less now about what the “little guy” can cobble together, because the “little guy” is capable of maybe pushing out a terrible text editor. Instead the userspace is just a pile of scraps tossed over from the major tech companies that somebody managed to get running.

      1. In some part, though not quite so pessimistically, this is what I was eluding too in my comment above, it wasn’t about training data, it was about run time infrastructure. Google etc, send audio to the cloud and get text back, offline voice recognition is something of the previous decades like Dragon and has it’s limitations. It would take some generous benefactor to run a server farm for this for free. There’s nothing stopping someone charging for services or finding a way to slap adverts on it, but would the Linux community swallow it? Could such a model work? So I was wondering if maybe there’s a way to distribute the load amongst users, but then that leads to questions of doing so securely, whether anything like this has been done before to solve a different problem?

        1. I’ve been thinking this over and I guess my issue here is one of conflicting philosophies. On the one hand, we have ESR’s landmark article “The Cathedral and the Bazaar”, which seems to indicate that regular Joes can cobble something together that would eventually be able to compete with Microsoft. And a lot of people still subscribe to that view. “A million eyes make every bug trivial…” still gets quoted all the time.

          But the reality of it seems to be that the Bazaar doesn’t exist in that form. Instead you have periodic trucks marked “Google” or “Apache” or “Red Hat” or “Facebook” that roll up from the nearby Cathedral and dump some leftovers into the market square. They’ve made a careful business decision that says: “we made this, but we don’t really want to own this: our value comes from what we make using the tool, not the tool itself.” So of course they open-source MapReduce. Now they get free work done on it to improve it, so they can turn around and just run it to scrape personal data from a billion users, or make search results arrive slightly faster, or sell a new phone with a smarter digital assistant, or whatever.

          Meanwhile Linux champions run around going “Isn’t this great? We have great software now! Oracle dumped a whole JDK on our feet!” Yes, great, but software in and of itself is not valuable. Now Oracle gets free fixes to Java for eternity, and they can turn around and sell contracts for support based on everyone else’s hard work.

          I used to be a Linux zealot, back when their biggest competitor was Microsoft on the desktop. Now, I feel that there is no innovation left in Linux. The only work being done is trying to tie in all the parts being “donated” by big players, playing catch-up to everything else (“yay we have speech recognition finally.”), and endless re-invention of core stuff that was working fine (systemd, a million re-spins of Debian into Ubuntu into Mint into …)

          The bazaar is dead, if it ever did exist. It’s not a bustling hive of activity crafting novel works any more. It’s just a chop shop now.

  3. Great work and really needed. Now speech recognition actually works it is going to become the dominant form of interaction with computers. I would hate to see Linux left behind.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s