Adding speech recognition to your embedded platform.

Last week, we posted a story about how to configure speech recognition at a beginner level. Several of the commenters expressed an interest in doing speech recognition for embedded devices. [Nickolay Shmyrev] volunteered to write some directions for those people. In this article, [Nickolay] will be taking you through the basics of setting up your embedded device with CMUsphinx an open source toolkit for speech recognition. He gives programming examples in both C and python. Though we are hosting this, we haven’t set it up and tried it, so please direct any questions you have at [Nickolay] in the comments.

Here we will consider how is it possible to implement speech recognition functions on using
Pocketsphinx library from CMUSphinx project

The advantages of using Pocketsphinx are:

  • Pocketsphinx is resource-efficient. It can perfectly run on embedded platforms. though it’s not limited to them, you can use pocketsphinx on your desktop/server. Pocketsphinxvhas support for fixed-point only arithmetics so can run without FPU. It is also optimized towards some popular platforms: Blackfin, Maemo, IPhone.
  • Pocketsphinx supports many languages out-of box. It supports US English, Chinese, French, Russian, German, Dutch and more without need to train anything.
  • Pocketsphinx is completely free software.
  • Available bindings for several programming languages are present.

So Pocketsphinx is really the best choice for your speech recognition library.

Before you are going to start with programming speech interfaces there are several things you need to know

  • Speech recognizers require you to specify the words they will understand (so-called grammar), they will not understand anything else except specified language.
  • Speech is by nature inaccurate, you need to put this in the corner of speech interface design. Recognizer return you confidence value of the recognized text. Make sure you use this confidence value to reject unreliable results. If recognizer is not confident, try to input the text again, ask for additional information, confirm user intentions.
  • It’s not the task of the speech recognition library to do sound input. Audio interfaces are often device-specific. You need to record audio in your application and put it in special format – PCM, mono, 8kHz, 16-bit. Doublecheck that. If you have mp3, convert it. If you have audio with 44.1kHz, downsample it.

Let’s start with simple test. Once you installed Pocketsphinx, just run Pocketsphinx_continuous
without any arguments. Wait till

READY…

will appear on terminal then say something. Pocketsphinx will record audio from your microphone and output recognition results.

000000001: hello (-11998485)

You failed to make it recognize hello? Don’t worry, some people find that it’s a hand of fortune who produce the recognition results. Count you are lucky.

Now let’s try to learn how to specify the grammar, the language that Pocketsphinx will recognize.
It’s done using grammar files which are written in JSGF format.

This is rather simple human-readable text format, probably it’s better to start with example:

#JSGF V1.0;
grammar goforward;
public <move> = go <direction> <distance> [meter | meters];
<direction>= forward | backward;
<distance>= (one | two | three | four | five | six | seven | eight | nine | ten | twenty)+;

You see it can specify alternatives, repetitions and skips. Basically JSGF describes
finite state automation for the recognizer. The more restrictive your grammar is, the
better will be recognition accuracy. But don’t forget to include all those fillers
and false starts in real grammar. User will not say to the device

“Pizza with pepperoni”

They will say instead

“I want, let me think… three pizzas with pepperoni no… with onions”

And your grammar should cover that. Once you’ve created your grammar, store it as
grammar.jsgf. Also, record audio file at 8khz mono and name it “myrecording.wav”.

Now, let’s do some of the programming. To demonstrate how speech recognition application is created, let’s first try to use Pocketsphinx with Python. Python API is really simple, example is just six lines of code. To recognize speech you need to accomplish 3 steps and here they are:

#!/usr/bin/python

#Step 1, Initialization
import pocketsphinx as ps
decoder = ps.Decoder(jsgf=’/path/to/your/jsgf/grammar.jsgf’,samprate=’8000′)
# Step 2, open the audio file.
fh = open(“myrecording.wav”, “rb”)
nsamp = decoder.decode_raw(fh)
# Step 3, get the result
hyp, uttid, score = decoder.get_hyp()
print “Got result %s %d” % (hyp, score)

Now, let’s do the same with C. It’s not really different from python, just more suitable
for your device.

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
ps_decoder_t *ps;
cmd_ln_t *config;
FILE *fh;
char const *hyp, *uttid;
int16 buf[512];
int rv;
int32 score;

/* Initializing of the configuration */
config = cmd_ln_init(NULL, ps_args(), TRUE,
“-samprate”, “8000″,
“-jsgf”, “test.jsgf”,
NULL);
ps = ps_init(config);

/* Open audio file and start feeding it into the decoder */
fh = fopen(“myrecording.wav”, “rb”);
rv = ps_start_utt(ps, “goforward”);
while (!feof(fh)) {
size_t nsamp;
nsamp = fread(buf, 2, 512, fh);
rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
rv = ps_end_utt(ps);

/* Get the result and print it */
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf(“Recognized: %s with prob %d\n”, hyp, ps_get_prob (ps, NULL));

/* Free the stuff */
fclose(fh);
ps_free(ps);
return 0;
}

On Linux, compile the demo with simple command line:

gcc `pkg-config pocketsphinx –cflags –libs` demo.c -o demo

and run

./demo

If it works, it’s ready to be included into your device. Read more about Pocketsphinx functions
in API guide:

http://cmusphinx.sourceforge.net/api/pocketsphinx/

Once you are done with basic examples, it’s time to build your application using Pocketsphinx.
Free your mind when you design that, don’t just focus on simple commands like “turn on lights”. Modern applications include intelligent logic analysis, continuous dictation support and many more things.
Try to be reasonable, design your interface and grammars, think about user and your speech
application will be successful.

Still don’t believe it will work? Check this video demonstrating pocketsphinx running on Nokia N800(at the top of the post). For more details on Pocketsphinx, CMUSphinx project, speech recognition visit http://cmusphinx.sourceforge.net

Adding speech recognition feature to your device

Here we will consider how is it possible to implement speech recognition functions on using
Pocketsphinx library from CMUSphinx project (http://cmusphinx.sourceforge.net)

The advantages of using Pocketsphinx are:

* Pocketsphinx is resource-efficient. It can perfectly run on embedded platforms
though it’s not limited to them, you can use pocketsphinx on your desktop/server. Pocketsphinx
has support for fixed-point only arithmetics so can run without FPU. It is also optimized
towards some popular platforms: Blackfin, Maemo, IPhone.
* Pocketsphinx supports many languages out-of box. It supports US English, Chinese, French, Russian, German, Dutch and more without need to train anything.
* Pocketsphinx is completely free software.
* Available bindings for several programming languages are present.

So Pocketsphinx is really the best choice for your speech recognition library.

Before you are going to start with programming speech interfaces there are several things you need to know

* Speech recognizers require you to specify the words they will understand (so-called grammar), they
will not understand anything else except specified language.
* Speech is by nature inaccurate, you need to put this in the corner of speech interface design.
Recognizer return you confidence value of the recognized text. Make sure you use this confidence
value to reject unreliable results. If recognizer is not confident, try to input the text again,
ask for additional information, confirm user intentions.
* It’s not the task of the speech recognition library to do sound input. Audio interfaces
are often device-specific. You need to record audio in your application and put it in special
format – PCM, mono, 8kHz, 16-bit. Doublecheck that. If you have mp3, convert it. If you have
audio with 44.1kHz, downsample it.

Let’s start with simple test. Once you installed Pocketsphinx, just run Pocketsphinx_continuous
without any arguments. Wait till

READY…

will appear on terminal then say something. Pocketsphinx will record audio from your microphone and output recognition results.

000000001: hello (-11998485)

You failed to make it recognize hello? Don’t worry, some people find that it’s a hand of fortune who produce the recognition results. Count you are lucky.

Now let’s try to learn how to specify the grammar, the language that Pocketsphinx will recognize.
It’s done using grammar files which are written in JSGF format.

http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/

This is rather simple human-readable text format, probably it’s better to start with example:

#JSGF V1.0;
grammar goforward;
public <move> = go <direction> <distance> [meter | meters];
<direction> = forward | backward;
= (one | two | three | four | five | six | seven | eight | nine | ten | twenty)+;

You see it can specify alternatives, repetitions and skips. Basically JSGF describes
finite state automation for the recognizer. The more restrictive your grammar is, the
better will be recognition accuracy. But don’t forget to include all those fillers
and false starts in real grammar. User will not say to the device

“Pizza with pepperoni”

They will say instead

“I want let me think… three pizzas with pepperoni no… with onions”

And your grammar should cover that. Once you’ve created your grammar, store it as
grammar.jsgf. Also, record audio file at 16khz mono and name it “myrecording.wav”.

Now, let’s do some of the programming. To demonstrate how speech recognition application is created, let’s first try to use Pocketsphinx with Python. Python API is really simple, example is just six lines of code. To recognize speech you need to accomplish 3 steps and here they are:

#!/usr/bin/python

#Step 1, Initialization
import pocketsphinx as ps
decoder = ps.Decoder(jsgf=’/path/to/your/jsgf/grammar.jsgf’,samprate=’8000′)
# Step 2, open the audio file.
fh = open(“myrecording.wav”, “rb”)
nsamp = decoder.decode_raw(fh)
# Step 3, get the result
hyp, uttid, score = decoder.get_hyp()
print “Got result %s %d” % (hyp, score)

Now, let’s do the same with C. It’s not really different from python, just more suitable
for your device.

#include <pocketsphinx.h>

int
main(int argc, char *argv[])
{
ps_decoder_t *ps;
cmd_ln_t *config;
FILE *fh;
char const *hyp, *uttid;
int16 buf[512];
int rv;
int32 score;

/* Initializing of the configuration */
config = cmd_ln_init(NULL, ps_args(), TRUE,
“-samprate”, “8000″,
“-jsgf”, “test.jsgf”,
NULL);
ps = ps_init(config);

/* Open audio file and start feeding it into the decoder */
fh = fopen(“myrecording.wav”, “rb”);
rv = ps_start_utt(ps, “goforward”);
while (!feof(fh)) {
size_t nsamp;
nsamp = fread(buf, 2, 512, fh);
rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
}
rv = ps_end_utt(ps);

/* Get the result and print it */
hyp = ps_get_hyp(ps, &score, &uttid);
if (hyp == NULL)
return 1;
printf(“Recognized: %s with prob %d\n”, hyp, ps_get_prob (ps, NULL));

/* Free the stuff */
fclose(fh);
ps_free(ps);
return 0;
}

On Linux, compile the demo with simple command line:

gcc `pkg-config pocketsphinx –cflags –libs` demo.c -o demo

and run

./demo

If it works, it’s ready to be included into your device. Read more about Pocketsphinx functions
in API guide:

http://cmusphinx.sourceforge.net/api/pocketsphinx/

Once you are done with basic examples, it’s time to build your application using Pocketsphinx.
Free your mind when you design that, don’t just focus on simple commands like “turn on lights”. Modern applications include intelligent logic analysis, continuous dictation support and many more things.
Try to be reasonable, design your interface and grammars, think about user and your speech
application will be successful.

Still don’t believe it will work? Check this video http://www.youtube.com/watch?v=OEUeJb6Pwt4
demonstrating pocketsphinx running on Nokia N800. For more details on Pocketsphinx, CMUSphinx project, speech recognition visit http://cmusphinx.sourceforge.net

Comments

  1. nebulous says:

    The word in the title should be ‘speech’. Just thought I’d mention it. Looks like good info, will read later (after Holland wins the cup)

  2. mostlymac says:

    Hate to be that guy… but speech is misspelled in the article title.

    As for the article itself, I’m stunned. That’s an amazing piece of software they’ve got going. I’d love to see somebody develop a third-party app for the iPhone that doesn’t “play songs by Beck” when I’m trying to “dial home”.

    I remember when Dragon NaturallySpeaking came out for Windows 95 ages ago. It had terrible accuracy, but with clear articulation and some training (on both ends), it would spit out a decent output. It’s amazing to see how far technology has improved. Now I’m just waiting to step on an elevator and say “Ten Forward”…

  3. Hackaaaaaaaaaaaa says:

    mostlymac .. that requires a very general grammar and is very hard to train.

    It is best if you train on a small grammar likes letters, numbers, and directions.

  4. nave.notnilc says:

    nice post, sphinx is some neat stuff; now I just need to find something to do with it :/

  5. turn.self.off says:

    nice to see the nokia N800 still getting some screen time :)

  6. Mattj says:

    Yeah, it was ahead of it’s time.

  7. normaldotcom says:

    Pocketsphinx is pretty awesome, I’m working on integrating it with my Asterisk install (maybe with some voice-controlled zork).

  8. nsh says:

    > I’m working on integrating it with my Asterisk
    > install (maybe with some voice-controlled zork).

    Hello normaldotcom

    For asterisk integration, please check
    http://scribblej.com/svn/

  9. Taylor Cox says:

    So we could write code say in C code and be able to control our windows or linux desktop or laptop by voice?

  10. Casey O'Donnell says:

    hey i got two n800s except one has a broken screen :( :( :(. they are pretty neat i get a week and a half on battery with ebook reading.

  11. nsh says:

    > So we could write code say in C code and be able to control our windows or linux desktop or laptop by voice?

    Absolutely

  12. strider_mt2k says:

    That’s happening pretty fast for that tablet.
    Nice.

  13. Gottabethatguy says:

    Would it be possible to use this with one of the more powerful microcontrollers? I only need to be able to recognize at most 10 words and I can easily cut that back to 4 words without losing the intended functionality of what I’m trying to develop.

  14. nsh says:

    Gottabethatguy, what kind of microcontroller are you talking about, what are specifications?

    The requirements for HMM-based recognition are still high, but it’s possible to find more lightweight solutions for your case.

  15. Sree Ram says:

    Great ! got me started , but what about decoding for live audio from mic ? any small hint would do :)
    thks

  16. Calin says:

    I’m thinking to do this by using coils from defective hard disks headers. can this be possible?

  17. steve says:

    I have a robot and I want to use Pocketsphinx so I can talk to the robot thing like…where is this room and it will tell me where it is or move foward and it should move forward. Right now I have install pockectsphinx.07 and sphinxbase and when I run using ubuntu 10.04LTS: pocketsphinx_continuous -lm 1998.lm -dict .dict 1998.dic it say READY then listening the when I say something like Good morning it write back Goodmorning….But how do I go from here…how do I use pocketsphinx to allow me to just talk and have what I just said be recorded and send to my robot to move…PLEASE HELP w78steve@gmail.com

  18. leandromattioli says:

    Hi!

    I’m trying to run your examples in Python and C, both give me the following error:

    ERROR: “acmod.c”, line 88: Must specify -mdef or -hmm

    Do you know what’s triggering this problem?

    Thanks in advance.

    • as says:

      have you solve this issue?
      ERROR: “acmod.c”, line 88: Must specify -mdef or -hmm

      my command is
      /usr/local/bin/pocketsphinx_continuous -infile “/var/spool/asterisk/voicemail/default/1111/INBOX/msg0007.wav” -hmm /var/lib/asterisk/communicator -samprate 8000 2

  19. hex says:

    Could not get pocket sphinx to even do remotely relevant speech recognition. All the “matching” text was useless gibberish.

  20. Diego09310 says:

    I installed sphinxbase and pocketsphinx doing ./configure, make and sudo make install. When I run pocketsphinx_continuous it works, but when I try to compile the example, I get: “demo.c:1:26: fatal error: pocketsphinx.h: No such file or directory
    compilation terminated.”
    How can I tell gcc where is pocketsphinx?

    Thank you!

    • Diego09310 says:

      I solved the problem by adding the paths to the .h files:
      gcc `pkg-config pocketsphinx –cflags –libs` -I/home/pi/Instalaciones/voice-recognition/pocketsphinx-0.8/include -I/home/pi/Instalaciones/voice-recognition/sphinxbase-0.8/include/ demo.c -o demo.o
      Now I get a stranger output:
      /tmp/ccXiWv5C.o: In function `main’:
      demo.c:(.text+0×18): undefined reference to `ps_args’
      demo.c:(.text+0×50): undefined reference to `cmd_ln_init’
      demo.c:(.text+0x5c): undefined reference to `ps_init’
      demo.c:(.text+0×84): undefined reference to `ps_start_utt’
      demo.c:(.text+0xd8): undefined reference to `ps_process_raw’
      demo.c:(.text+0xf8): undefined reference to `ps_end_utt’
      demo.c:(.text+0×118): undefined reference to `ps_get_hyp’
      demo.c:(.text+0×140): undefined reference to `ps_get_prob’
      demo.c:(.text+0×164): undefined reference to `ps_free’
      collect2: ld returned 1 exit status

      Does anybody how to solve this?

      Thanks!

  21. Diego09310 says:

    Solved! Just in case somebody gets this error:
    When I copied the comand to compile, the double dash (–) was replaced by a longer dash (em dash? —).

    • fito_segrera says:

      Hi Diego09310, I’m having the same problem you had:

      demo.c:(.text+0×18): undefined reference to `ps_args’
      demo.c:(.text+0×50): undefined reference to `cmd_ln_init’
      demo.c:(.text+0x5c): undefined reference to `ps_init’
      demo.c:(.text+0×84): undefined reference to `ps_start_utt’
      demo.c:(.text+0xd8): undefined reference to `ps_process_raw’
      demo.c:(.text+0xf8): undefined reference to `ps_end_utt’
      demo.c:(.text+0×118): undefined reference to `ps_get_hyp’
      demo.c:(.text+0×140): undefined reference to `ps_get_prob’
      demo.c:(.text+0×164): undefined reference to `ps_free’

      What do you mean by “When I copied the comand to compile, the double dash (–) was replaced by a longer dash (em dash? —).”??

      • Diego09310 says:

        Hi fito_segrera, I think I didn’t explain well (by reading my comment again).

        In the command “gcc `pkg-config pocketsphinx –cflags –libs` demo.c -o demo” you can see that the dash before cflags and libs is longer than the dash between pkg and config or -o. This is because it’s meant to be two dashes “- -” (I introduced an space between them so they appear as two dashes in the comment).

        If you don’t understand me (I’m not being as clear as I’d like to), I suggest you look at the pkg-config example in the wikipedia: http://en.wikipedia.org/wiki/Pkg-config

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s