You might have heard about LLaMa or maybe you haven’t. Either way, what’s the big deal? It’s just some AI thing. In a nutshell, LLaMa is important because it allows you to run large language models (LLM) like GPT-3 on commodity hardware. In many ways, this is a bit like Stable Diffusion, which similarly allowed normal folks to run image generation models on their own hardware with access to the underlying source code. We’ve discussed why Stable Diffusion matters and even talked about how it works.
LLaMa is a transformer language model from Facebook/Meta research, which is a collection of large models from 7 billion to 65 billion parameters trained on publicly available datasets. Their research paper showed that the 13B version outperformed GPT-3 in most benchmarks and LLama-65B is right up there with the best of them. LLaMa was unique as inference could be run on a single GPU due to some optimizations made to the transformer itself and the model being about 10x smaller. While Meta recommended that users have at least 10 GB of VRAM to run inference on the larger models, that’s a huge step from the 80 GB A100 cards that often run these models.
While this was an important step forward for the research community, it became a huge one for the hacker community when [Georgi Gerganov] rolled in. He released llama.cpp on GitHub, which runs the inference of a LLaMa model with 4-bit quantization. His code was focused on running LLaMa-7B on your Macbook, but we’ve seen versions running on smartphones and Raspberry Pis. There’s even a version written in Rust! A rough rule of thumb is anything with more than 4 GB of RAM can run LLaMa. Model weights are available through Meta with some rather strict terms, but they’ve been leaked online and can be found even in a pull request on the GitHub repo itself.
Aside from occasionally funny and quirky projects, how does having a local GPT-3 like chatbot impact us? The simple fact is that it is accessible to hackers. Not only can you run it, but the code is available, the models are trained on publicly available data, so you could train your own though it took 21 days on 2048 A100 GPUs, and it’s useful enough to provide reasonable output. Stanford even released a version called Alpaca that is LLaMa-7B fine-tuned for instruction following which elevates it from a simple chatbot to a bot able to follow instructions. There is even a guide on how to replicate Alpaca yourself for less than $100 of cloud computing.
Of course, like most current LLMs, LLaMa suffers from the same problems of hallucination, bias, and stereotypes. When asked to generate code, it can try to request endpoints that don’t exist. When asked what the capital of Tanzania is, it will reply Dar es Salaam instead of Dodoma. Researchers haven’t solved the problem of trying to secure a black box, as it is still far too easy to get the model to do something its creators tried hard to prevent.
While it is incredible to think that just a few weeks ago it would have been ridiculous to think you could run a GPT-3 level model on your personal laptop, this ultimately asks the question: what will we do with this? The easy answer is sophisticated spam. Long term there are concerns that large language models could replace programmers and writers. For writing or tweaking small programs, it is already quite good as [Simon Wilson] demonstrated by asking it to generate some AppleScript. However, that is still up for debate. Being able to spit out an accurate answer to a question does not a human make. What do you do with the raw sort of bias-confused amorphous intelligence that is ChatGPT and other LLMs now running locally?
Rather than connecting to an API, the Raspberry Pi inside of this old typewriter can run it entirely locally with no internet connection required. Because the model is smaller, it becomes much easier to fine-tune for your use case. By taking a bunch of dialog from a TV show (let’s say the Simpsons) you could fine-tune the model to respond like a character from the show. Looking further into the future, there is an excellent paper called ReAct that tries to put something like an internal dialog into chat GPT by asking it to output questions, thoughts, and actions. A good example might be this:
Question: How much bigger is the land in Seattle, Washington versus the water?
Thought: I need to use Wikipedia to look up the square footage of the city area and the water
Action: search_wikipedia: Seattle, WA
• City 142.07 sq mi (367.97 km2)
• Land 83.99 sq mi (217.54 km2)
• Water 58.08 sq mi (150.43 km2)
• Metro 8,186 sq mi (21,202 km2)
Thought: The city is 142.07 square miles and the water is 58.08 square miles, I should calculate a ratio.
Action: calculate: 142.07 / 58.08
Answer: The land is 2.4x the size of the water or 83.99 square miles bigger
You can see how this forms a loop where complex actions can be broken down to be performed by simplified helpers, like searching Wikipedia, calling APIs, controlling smart appliances, or actuating motors. Google has been experimenting with the concept in their PaLM-SayCan system, which uses an LLM (PaLM) and breaks it down into smaller tasks.
We can see LLaMa powering NPCs in video games, optimizing blog titles, and controlling robots. So understandably, we’re quite curious to see what you all do with it. One thing is for sure, though. Putting this in the hands of creative hackers is going to be fun.
26 thoughts on “Why LLaMa Is A Big Deal”
Winamp, it really whips the llama’s ass!
(sorry but with this title image I couldn’t resist.)
Every time I see the word llama I hear that sound clip and then its 1999 all over again.
Audio players without suggestive algorithms and ads? BLASPHEMY
I never thought I’d be old enough to say this… I miss the “old days” when llamas asses were whipped and 192bit VBR was “luxury.”
“Researchers haven’t solved the problem of trying to secure a black box, as it is still far too easy to get the model to do something its creators tried hard to prevent.”
Maybe we can look inside the black box.
Er, WRONG ANSWER!
Wikipedia says (quoted from the article, not checked against the source):
• City 142.07 sq mi (367.97 km²)
• Land 83.99 sq mi (217.54 km²)
• Water 58.08 sq mi (150.43 km²)
City (142.07 sq mi) is the sum of Land (83.99 sq mi) and Water (58.08 sq mi), so the answer should be that the Land is 1.45× the size of Water or larger by 25.91 sq mi.
This illustrates one of the insidious dangers of this kind of “AI”: it’ll give you answers that are clear, authoritative, and backed up with references and citations.
And it’s all wrong.
It’s just making up references and citations that sound right; that fit the pattern.
Oh, wait… it used the city area instead of the land, which is _larger_.
Maybe it tried to interpret the question and the meaning of “land”…?
The issue is that this software doesn’t *interpret* at all, just probabilisticly transform. So if training data used a word in different senses, it’s basically a crap shoot.
Maybe they used Bard on accident. xD
The Seattle land/water answer is wrong. The answer should be land (83.99 sq mi) divided by water (58.08) or ~1.4, not total (83.99 + 58.08 = 142.07) divided by water.
You can see how this forms a loop where complex actions can be broken down INTO INCORRECT STEPS
It uses the city area which is strangely larger than the land and even says so in its reasoning.
Maybe it should use carpenter rule ” Measure twice, cut once” or in this case two sources.
Great article. Inference for Pi and Smartphones is still quite slow. I couldn’t find any rates for what a laptop can do.
Unfortunately training is still prohibitively expensive.
Once training cost comes down to reasonable prices, independent devs can experiment, too.
Sounds like they should run optimizations BEFORE using weeks of supercomputer power to train their models.
‘LLaMa is a transformer language model from Facebook/Meta research, which is a collection of large models from 7 billion to 65 billion parameters trained on publicly available datasets’
This is kind of worying to me. Just what is this “publically avalable data”? Could it be all that Facebook/Twitter etc chatter that we all know is all so full of actual facts and un-doctored images, 100% accurate , written by experts in their fields and fully verified against reputable sources.
I quote (loudly) the simple statement I learned in my early computing days, GIGO…Garbage In, Garbage Out.
This is worrying! Facebook and so on are full of garbage, so if you train “AI” with this what will you get??
Tay. Ask Microsoft about that.
“there is an excellent paper called ReAct that tries to put something like an internal dialog into chat GPT by asking it to output questions, thoughts, and actions.” Do you have a link for ReAct? A Google search diesn’t turn up anything. Thanks! -JJB
Training needs 80 GB A100 cards
Prediction runs on a single GPU
Maybe learn how machine learning works before making dumb comments.
I reckon the hallucination problem will be solved pretty quickly.
Not enough models have an output for “I don’t know”, I don’t think that was popularised until recently. And being able to determine the context is important too; the ability to hallucinate is incredibly useful for generating fiction, poetry etc. But in the context of a question requiring a factual answer it obviously does not help at all.
It’s “what is the capital of New Zealand” and “if unicorns existed what would the rough electrical conductivity of their horns” (reference most horns coverings are keratin, electrical conductivity of keratin) versus “how many unicorns are alive in the world today?” (answer is realistically 0, hallucination may mean >0).
But determining the context is really hard – I think when we look at these language models being “stupid” we forget just how confident many humans are in answering questions they know nothing about, inventing facts, parroting falsities. “If New Zealand existed, why would it have three islands?” to which chatgpt actually replies to point out that: NZ exists, that it technically has more than three islands, but that the north/south/Rakiura could be considered the main 3. Not too shabby!
How much energy such model requires for a single… some single unitary operation?
Please be kind and respectful to help make the comments section excellent. (Comment Policy)