LLMs (Large Language Models) for local use are usually distributed as a set of weights in a multi-gigabyte file. These cannot be directly used on their own, which generally makes them harder to distribute and run compared to other software. A given model can also have undergone changes and tweaks, leading to different results if different versions are used.
To help with that, Mozilla’s innovation group have released llamafile, an open source method of turning a set of weights into a single binary that runs on six different OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD) without needing to be installed. This makes it dramatically easier to distribute and run LLMs, as well as ensuring that a particular version of LLM remains consistent and reproducible, forever.
This wouldn’t be possible without the work of [Justine Tunney], creator of Cosmopolitan, a build-once-run-anywhere framework. The other main part is llama.cpp, and we’ve covered why it is such a big deal when it comes to running self-hosted LLMs.
There are some sample binaries available using the Mistral-7B, WizardCoder-Python-13B, and LLaVA 1.5 LLMs. Just keep in mind that if you’re on a Windows platform, only the LLaVA 1.5 will run, because it’s the only one that squeaks under the 4 GB limit on executable files that Windows has. If you run into issues, check out the gotchas list for troubleshooting tips.
Justine Tunney is truly a treasure. Everything she does is fascinating.
My ZOTAC 3060 died. I shall have to suffer without my LLMs, until they RMA the card and replace it.
Thanks for letting us know. Now we can die peacefully.
Ollama works without GPU.
I found Ollama to be really easy to use, but slow on a crappy old O-Della desktop.
It’s just a wrapper around llama.cpp. Why not using llama.cpp directly instead?
we waiting on Serenity OS version
Just what we needed, you’re a treasure. Keep up the good work 👍
A very good project.
I’ve been trying to keep up with the AI revolution, and my system is littered with dozens of packages and frameworks needed to run some of the AI systems. Python is a nice language, but the library and language version inconsistencies are so bad that you now have to install a package (conda) that sets up the specific language environment versions for any AI thing you want to run.
This seems to be a problem in the scientific community, where results are calculated using a certain version of Python and libraries, and then 5 years later no one can reproduce the results (of a published paper) because older binaries are no longer available.
It’s also a problem with basic linux: if you happen to be using a major distro such as Mint, the newest versions of any library might not be in the current distro, you may need to wait 6 months in order to run something natively… or you could try adding the repo for that one thing you need, do the install, and hope it doesn’t crash your system.
Which is why people are distributing flatpaks for programs now, which are programs with statically compiled libraries so that everything you need is in one really huge executable. I like the convenience of not having to spend hours installing new things just to run an application, but boy those apps are huge!
Anyway, this looks like a very good project that will reduce some of the friction of using AI.
We did statically linked kernels and programs back in the day, specifically. We’d get consistent results and reliability that way. Bigger, for sure, but we knew it was going to “behave” in prod.
Dynamic linking was a mistake
Imagine considering mint a major distro. Lol.
That 4GiB limit could have been circumvented if the data is appended to the executable (after the PE file end, as specified in the PE headers). The the PE loader will simply ignore the additional data/size.
Worst case, extract the small exe on the fly, and let it read from the large archive file, without extracting the data.
It would be interesting to make a LLM as a binary blob and then have the majority of the data as a separate set of files, rather than lumped into a single executable.
I wonder how much of a lobotomy you can give a LLM, with fixing the file header, before it begins to stumble for the majority of output cases.
Note to self, readme contains, “Using llamafile with external weights.”.
I’d not thought of that (set the end of the executable part in the header); good idea.
Now I’m curious if you can do “memory mapped files” (Windows) for the archive. I was having odd problems with a game and when down the rabbit hole I found they had used that feature with their .pak files. (Unrelated to the problem.)
I’m now wondering if someone has ever created a program that can cut up executables that are larger than 4GB [or mask their size from Windows] and still allow them to run normally.
How long before a cyber criminal holds them into apps and scripts on the local PC and turns those executables into their peons though?
my cheap phone with a dimensity 700 something cpu and 4gb of ram should be able to run some models locally, i’ll update if it actually works
I hope to one day have a locally running llm so that I can use it with vim, to do latex formatting and writing documents. So far I’ve seen some things that can use OpenAI but the API isn’t working without paying a ton of money.