Here’s A Plain C/C++ Implementation Of AI Speech Recognition, So Get Hackin’

November 27, 2022

[Georgi Gerganov] recently shared a great resource for running high-quality AI-driven speech recognition in a plain C/C++ implementation on a variety of platforms. The automatic speech recognition (ASR) model is fully implemented using only two source files and requires no dependencies. As a result, the high-quality speech recognition doesn’t involve calling remote APIs, and can run locally on different devices in a fairly straightforward manner. The image above shows it running locally on an iPhone 13, but it can do more than that.

Implementing a robust speech transcription that runs locally on a variety of devices is much easier with [Georgi]’s port of OpenAI’s *Whisper*.

[Georgi]’s work is a port of OpenAI’s Whisper model, a remarkably-robust piece of software that does a truly impressive job of turning human speech into text. Whisper is easy to set up and play with, but this port makes it easier to get the system working in other ways. Having such a lightweight implementation of the model means it can be more easily integrated over a variety of different platforms and projects.

The usual way that OpenAI’s Whisper works is to feed it an audio file, and it spits out a transcription. But [Georgi] shows off something else that might start giving hackers ideas: a simple real-time audio input example.

By using a tool to stream audio and feed it to the system every half-second, one can obtain pretty good (sort of) real-time results! This of course isn’t an ideal method, but the robustness and accuracy of Whisper is such that the results look pretty great nevertheless.

You can watch a quick demo of that in the video just under the page break. If it gives you some ideas, head over to the project’s GitHub repository and get hackin’!

19 thoughts on “Here’s A Plain C/C++ Implementation Of AI Speech Recognition, So Get Hackin’”

RW ver 0.0.3 says:

November 27, 2022 at 5:45 pm

Local ! Yisssss…. now to dig out my IBM Home Director lol

Report comment

Reply
1. The Commenter Formerly Known As Ren says:
  
  November 27, 2022 at 6:47 pm
  
  Yeah, mine’s been buried for a decade or so…
  
  Report comment
  
  Reply
PtaQ PLYTP says:

November 27, 2022 at 6:03 pm

I remember when some folks made AI that learned voices of polish voice actours from Gothic II game. Then they used this AI to dub very vulgar adult movies. Original voice actours go angered so hardly that they sued for damage and even got those poor nerds imprisoned for improper use of AI.

Report comment

Reply
Charlie says:

November 27, 2022 at 6:11 pm

Thats a pretty impressive transcription – at least as good as I what I would do as a human, and a damn side faster!

Report comment

Reply
Eifel says:

November 27, 2022 at 7:44 pm

That’s so awesome. I’ve always wanted a voice controlled computer assistant, but I will never, ever install some closed source, monetized, “we send everything you say to our servers” wiretap.

Maybe this technology has finally disseminated enough that that can be done without a PHD in artificial intelligence.

Report comment

Reply
1. Michael Viejo-Robles says:
  
  November 28, 2022 at 2:06 pm
  
  Check your pocket.
  
  Report comment
  
  Reply
  1. Nick says:
    
    March 8, 2023 at 7:39 pm
    
    I assume you’re talking about a smartphone. If it’s “in your pocket” as you say, then it’s unlikely to be able to hear very clearly, especially with its constraints on bandwidth, CPU and ultimately power. I get your point but a smartphone is very different from Alexa or google home. Not to mention it’s still possible to walk around without a smartphone, or even a phone at all!
    
    Report comment
    
    Reply
𐂀 𐂅 says:

November 27, 2022 at 9:25 pm

Works like a charm on Debian 11 running on an AMD Ryzen 7 5700G. I tested all of the examples and it was perfect, not a single issue. Many thanks to all those involved in the project.

Report comment

Reply
1. nes says:
  
  November 30, 2022 at 2:30 pm
  
  Wish I could say the same for the AVX2 version running on an 8th gen core i7. Consistently 10x slower than the Apple ARM metrics given in the README. Real time dictation = forget it! Still a nice piece of work. Maybe it’s a sign I should switch to an M2.
  
  Report comment
  
  Reply
  1. 𐂀 𐂅 says:
    
    December 1, 2022 at 2:43 pm
    
    Try building it under Clear Linux? That is Intel optimised. I forgot to mention I am using the 6.1.0-0-amd64 Linux kernel. Also play with the -t option I found 8 threads was optimal even when I have 16 cores.
    
    Report comment
    
    Reply
tetsuoii says:

November 27, 2022 at 11:30 pm

Simple C libs and progs are the right way to make software. I’ll check this out.

Report comment

Reply
Martin says:

November 28, 2022 at 3:18 am

Ingenious work ! Thank you. Great toy for further playing with.

Report comment

Reply
Olivier says:

November 28, 2022 at 3:50 am

Thats awesome! Thank you [Georgi] (and OpenAI) for the work & HaD for sharing!!

Report comment

Reply
Andrew says:

November 28, 2022 at 10:26 am

Ok, how long till it gets wrapped up for a python library :D

Report comment

Reply
Steve Takach (@stakach) says:

November 28, 2022 at 4:23 pm

Would be cool compile this targeting WASM and then doing real-time subtitles on videos in the browser. Complete edge processing

Report comment

Reply
a says:

November 29, 2022 at 4:34 am

yeah, please, a plugging for vlc or whatever should be perfect for people with hearing problems so they can enjoy their old family videos.
My mom could use this soo much

Report comment

Reply
1. Sandro says:
  
  November 30, 2022 at 6:05 pm
  
  Great idea, but I don’t know about anyone else, but my family videos have a lot of people talking over each other. ;-)
  
  Report comment
  
  Reply
  1. fid says:
    
    November 30, 2022 at 6:56 pm
    
    My family home videos are super-8 films on 7″ reels.
    
    Report comment
    
    Reply
Samarth says:

September 22, 2023 at 4:24 am

How did you accomplish this? Do you have the code for this? I really need the code for understanding pov cause i have a minor project coming which is based on the same context model you have shown over here. Pls help me if possible

Report comment

Reply