Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects

Bark is a universal text-to-audio model that can not only create realistic speech, it can incorporate music, background noises, and sound effects. It can even include non-speech sounds like laughter, sighs, throat clearings, and similar elements. But despite the fact that it can deliver such complex results, it’s important to understand some of the peculiarities.

The model takes a prompt and generates the resulting sound from scratch. Results might sometimes be unexpected.

Bark is not a conventional text-to-speech program, and how it works has a lot more in common with large language model AI chatbots. This means that results can deviate from expectations, and outputs aren’t necessarily going to be studio-quality speech. As the project’s README points out, “(generated outputs can) be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.” That being said, there is some support for voice presets as a way to help guide the model with some consistency.

Bark was designed by a company called Suno for research purposes and is available under the MIT License. It can be installed and run locally, and has some demos available as well as an online implementation.

The ability to install and run Bark locally is promising territory for incorporating it into projects. And should you be more interested in speech-to-text instead, don’t forget about this plain C/C++ implementaion of AI-powered speech recognition.

8 thoughts on “Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects

  1. :D I asked Bark to read a phrase where the order of the words and numbers differs in spoken German from the one in written German (“Es ist 1:43 Uhr” has to be read as “Es ist 1 Uhr 43”). It managed to use the correct word order but said 46 instead of 43.

  2. do read the commentary when they say 12G for GPU memory it’s not a joke, further there is a 10.5G download involved with the speech models. It requires > 16G of RAM on your machine or an OOM error will occur and the application will be nuked. You may also need to be sure nothing much is running before starting.
    My Guess is you need 32G of main RAM and 4G or more GPU memory if your GPU supports CUDA. Fairly put CUDA is proprietary and not an open standard, so that means you have to have an NVidia card for GPU processing support. No mention of the open standard OpenCL was made.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.