Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects

July 24, 2023

Bark is a universal text-to-audio model that can not only create realistic speech, it can incorporate music, background noises, and sound effects. It can even include non-speech sounds like laughter, sighs, throat clearings, and similar elements. But despite the fact that it can deliver such complex results, it’s important to understand some of the peculiarities.

The model takes a prompt and generates the resulting sound from scratch. Results might sometimes be unexpected.

Bark is not a conventional text-to-speech program, and how it works has a lot more in common with large language model AI chatbots. This means that results can deviate from expectations, and outputs aren’t necessarily going to be studio-quality speech. As the project’s README points out, “(generated outputs can) be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.” That being said, there is some support for voice presets as a way to help guide the model with some consistency.

Bark was designed by a company called Suno for research purposes and is available under the MIT License. It can be installed and run locally, and has some demos available as well as an online implementation.

The ability to install and run Bark locally is promising territory for incorporating it into projects. And should you be more interested in speech-to-text instead, don’t forget about this plain C/C++ implementaion of AI-powered speech recognition.

8 thoughts on “Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects”

Robert Chadwick says:

July 24, 2023 at 11:55 am

Perfect for robotic telemarketers that sound more and more like real people.

Report comment

Reply
1. TG says:
  
  July 24, 2023 at 12:21 pm
  
  People out there still answer their phone?
  
  Report comment
  
  Reply
  1. The Commenter Formerly Known As Ren says:
    
    July 24, 2023 at 1:41 pm
    
    (Chuckle!)
    
    Report comment
    
    Reply
The Commenter Formerly Known As Ren says:

July 24, 2023 at 1:43 pm

Give it some CHATGPT poetry, set it to sing, might have a number one on the Billboard charts.

Report comment

Reply
Elliot Williams says:

July 25, 2023 at 3:16 am

Most of the output of these voice models has a high-frequency hash over the top of it. Does anyone know what’s causing that?

Report comment

Reply
1. Dan says:
  
  July 25, 2023 at 4:29 am
  
  Authenticity verification of the original speaker it was sampled from maybe?
  
  Report comment
  
  Reply
Daniel says:

July 25, 2023 at 5:00 am

:D I asked Bark to read a phrase where the order of the words and numbers differs in spoken German from the one in written German (“Es ist 1:43 Uhr” has to be read as “Es ist 1 Uhr 43”). It managed to use the correct word order but said 46 instead of 43.

Report comment

Reply
GenTooMan says:

July 25, 2023 at 9:53 pm

do read the commentary when they say 12G for GPU memory it’s not a joke, further there is a 10.5G download involved with the speech models. It requires > 16G of RAM on your machine or an OOM error will occur and the application will be nuked. You may also need to be sure nothing much is running before starting.
My Guess is you need 32G of main RAM and 4G or more GPU memory if your GPU supports CUDA. Fairly put CUDA is proprietary and not an open standard, so that means you have to have an NVidia card for GPU processing support. No mention of the open standard OpenCL was made.

Report comment

Reply