Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects

Bark is a universal text-to-audio model that can not only create realistic speech, it can incorporate music, background noises, and sound effects. It can even include non-speech sounds like laughter, sighs, throat clearings, and similar elements. But despite the fact that it can deliver such complex results, it’s important to understand some of the peculiarities.

The model takes a prompt and generates the resulting sound from scratch. Results might sometimes be unexpected.

Bark is not a conventional text-to-speech program, and how it works has a lot more in common with large language model AI chatbots. This means that results can deviate from expectations, and outputs aren’t necessarily going to be studio-quality speech. As the project’s README points out, “(generated outputs can) be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.” That being said, there is some support for voice presets as a way to help guide the model with some consistency.

Bark was designed by a company called Suno for research purposes and is available under the MIT License. It can be installed and run locally, and has some demos available as well as an online implementation.

The ability to install and run Bark locally is promising territory for incorporating it into projects. And should you be more interested in speech-to-text instead, don’t forget about this plain C/C++ implementaion of AI-powered speech recognition.

Bridging A Gap Between LLMs And Programming With TypeChat

By now, large language models (LLMs) like OpenAI’s ChatGPT are old news. While not perfect, they can assist with all kinds of tasks like creating efficient Excel spreadsheets, writing cover letters, asking for music references, and putting together functional computer programs in a variety of languages. One thing these LLMs don’t do yet though is integrate well with existing app interfaces. However, that’s where the TypeChat library comes in, bridging the gap between LLMs and programming.

TypeChat is an experimental MIT-licensed library from Microsoft which sits in between a user and a LLM and formats responses from the AI that are type-safe so that they can easily be plugged back in to the original interface. It does this by generating JSON responses based on user input, making it easier to take the user input directly, run it through the LLM, and then use the output directly in another piece of code. It can be used for things like prototyping prompts, validating responses, and handling errors. It’s also not limited to a single LLM and can be fairly easily modified to work with many of the existing models.

The software is still in its infancy but does hope to make it somewhat easier to work between user inputs within existing pieces of software and LLMs which have quickly become all the rage in the computer science world. We expect to see plenty more tools like this become available as more people take up using these new tools, which have plenty of applications beyond just writing code.

ChatGPT V. The Legal System: Why Trusting ChatGPT Gets You Sanctioned

Recently, an amusing anecdote made the news headlines pertaining to the use of ChatGPT by a lawyer. This all started when a Mr. Mata sued the airline where years prior he claims a metal serving cart struck his knee. When the airline filed a motion to dismiss the case on the basis of the statute of limitations, the plaintiff’s lawyer filed a submission in which he argued that the statute of limitations did not apply here due to circumstances established in prior cases, which he cited in the submission.

Unfortunately for the plaintiff’s lawyer, the defendant’s counsel pointed out that none of these cases could be found, leading to the judge requesting the plaintiff’s counsel to submit copies of these purported cases. Although  the plaintiff’s counsel complied with this request, the response from the judge (full court order PDF) was a curt and rather irate response, pointing out that none of the cited cases were real, and that the purported case texts were bogus.

The defense that the plaintiff’s counsel appears to lean on is that ChatGPT ‘assisted’ in researching these submissions, and had assured the lawyer – Mr. Schwartz – that all of these cases were real. The lawyers trusted ChatGPT enough to allow it to write an affidavit that they submitted to the court. With Mr. Schwartz likely to be sanctioned for this performance, it should also be noted that this is hardly the first time that ChatGPT and kin have been involved in such mishaps.

Continue reading “ChatGPT V. The Legal System: Why Trusting ChatGPT Gets You Sanctioned”

Leaked Internal Google Document Claims Open Source AI Will Outcompete Google And OpenAI

In the world of large language models (LLM), the focus has for the longest time been on proprietary technologies from companies such as OpenAI (GPT-3 & 4, ChatGPT, etc.) as well as increasingly everyone from Google to Meta and Microsoft. What’s remained underexposed in this whole discussion about which LLM will do more things better are the efforts by hobbyists, unaffiliated researchers and everyone else you may find in Open Source LLM projects. According to a leaked document from a researcher at Google (anonymous, but apparently verified), Google is very worried that Open Source LLMs will wipe the floor with both Google’s and OpenAI’s efforts.

According to the document, after the open source community got their hands on the leaked LLaMA foundation model, motivated and highly knowledgeable individuals set to work to take a fairly basic model to new levels where it could begin to compete with the offerings by OpenAI and Google. Major innovations are the scaling issues, allowing these LLMs to work on far less powerful systems (like a laptop or even smartphone).

An important factor here is Low-Rank adaptation (LoRa), which massively cuts down the effort and resources required to train a model. Ultimately, as this document phrases it, Google and in extension OpenAI do not have a ‘secret sauce’ that makes their approaches better than anything the wider community can come up with. Noted is also that essentially Meta has won out here by having their LLM leak, as it has meant that the OSS community has been improving on the Meta foundations, allowing Meta to benefit from those improvements in their products.

The dire prediction is thus that in the end the proprietary LLMs by Google, OpenAI and others will cease to be relevant, as the open source community will have steamrolled them into fine, digital dust. Whether this will indeed work out this way remains to be seen, but things are not looking up for proprietary LLMs.

(Thanks to [Mike Szczys] for the tip)

ChatGPT Makes A 3D Model: The Secret Ingredient? Much Patience

ChatGPT is an AI large language model (LLM) which specializes in conversation. While using it, [Gil Meiri] discovered that one way to create models in FreeCAD is with Python scripting, and ChatGPT could be encouraged to create a 3D model of a plane in FreeCAD by expressing the model as a script. The result is just a basic plane shape, and it certainly took a lot of guidance on [Gil]’s part to make it happen, but it’s not bad for a tool that can’t see what it is doing.

The first step was getting ChatGPT to create code for a 10 mm cube, and plug that in FreeCAD to see the results. After that basic workflow was shown to work, [Gil] asked it to create a simple airplane shape. The resulting code had objects for wing, fuselage, and tail, but that’s about all that could be said because the result was almost — but not quite — completely unlike a plane. Not an encouraging start, but at least the basic building blocks were there. Continue reading “ChatGPT Makes A 3D Model: The Secret Ingredient? Much Patience”

Chatting With Local AI Moves Directly In-Browser, Thanks To Web LLM

Large Language Models (LLM) are at the heart of natural-language AI tools like ChatGPT, and Web LLM shows it is now possible to run an LLM directly in a browser. Just to be clear, this is not a browser front end talking via API to some server-side application. This is a client-side LLM running entirely in the browser.

The ability to run an LLM (natural language AI) directly in-browser means more ways to implement local AI while enjoying GPU acceleration via WebGPU.

Running an AI system like an LLM locally usually leverages the computational abilities of a graphics card (GPU) to accelerate performance. This is true when running an image-generating AI system like Stable Diffusion, and it’s also true when implementing a local copy of an LLM like Vicuna (which happens to be the model implemented by Web LLM.) The thing that made Web LLM possible is WebGPU, whose release we covered just last month.

WebGPU provides a way for an in-browser application to talk to a local GPU directly, and it sure didn’t take long for someone to get the idea of using that to get a local LLM to run entirely within the browser, complete with GPU acceleration. This approach isn’t just limited to language models, either. The same method has been applied to successfully create Web Stable Diffusion as well.

It’s a fascinating (and fast) development that opens up new possibilities and, hopefully, gives people some new ideas. Check out Web LLM’s GitHub repository for a closer look, as well as access to an online demo.

Peering Down Into Talking Ant Hill

Watching an anthill brings an air of fascination. Thousands of ants are moving about and communicating with other ants as they work towards a goal as a collective whole. For us humans, we project a complex inner world for each of these tiny creatures to drive the narrative. But what if we could peer down into a miniature world and the ants spoke English? (PDF whitepaper)

Researchers at the University of Stanford and Google Research have released a paper about simulating human behavior using multiple Large Language Models (LMM). The simulation has a few dozen agents that can move across the small town, do errands, and communicate with each other. Each agent has a short description to help provide context to the LLM. In addition, they have memories of objects, other agents, and observations that they can retrieve, which allows them to create a plan for their day. The memory is a time-stamped text stream that the agent reflects on, deciding what is important. Additionally, the LLM can replan and figure out what it wants to do.

The question is, does the simulation seem life-like? One fascinating example is the paper’s authors created one agent (Isabella) intending to have a Valentine’s Day party. No other information is included. But several agents arrive at the character’s house later in the day to party. Isabella invited friends, and those agents asked some people.

A demo using recorded data from an earlier demo is web-accessible. However, it doesn’t showcase the powers that a user can exert on the world when running live. Thoughts and suggestions can be issued to an agent to steer their actions. However, you can pause the simulation to view the conversations between agents. Overall, it is incredible how life-like the simulation can be. The language of the conversation is quite formal, and running the simulation burns significant amounts of computing power. Perhaps there can be a subconscious where certain behaviors or observations can be coded in the agent instead of querying the LLM for every little thing (which sort of sounds like what people do).

There’s been an exciting trend of combining LLMs with a form of backing store, like combining Wolfram Alpha with chatGPT. Thanks [Abe] for sending this one in!