Mozilla Lets Folks Turn AI LLMs Into Single-File Executables

December 2, 2023

LLMs (Large Language Models) for local use are usually distributed as a set of weights in a multi-gigabyte file. These cannot be directly used on their own, which generally makes them harder to distribute and run compared to other software. A given model can also have undergone changes and tweaks, leading to different results if different versions are used.

To help with that, Mozilla’s innovation group have released llamafile, an open source method of turning a set of weights into a single binary that runs on six different OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD) without needing to be installed. This makes it dramatically easier to distribute and run LLMs, as well as ensuring that a particular version of LLM remains consistent and reproducible, forever.

This wouldn’t be possible without the work of [Justine Tunney], creator of Cosmopolitan, a build-once-run-anywhere framework. The other main part is llama.cpp, and we’ve covered why it is such a big deal when it comes to running self-hosted LLMs.

There are some sample binaries available using the Mistral-7B, WizardCoder-Python-13B, and LLaVA 1.5 LLMs. Just keep in mind that if you’re on a Windows platform, only the LLaVA 1.5 will run, because it’s the only one that squeaks under the 4 GB limit on executable files that Windows has. If you run into issues, check out the gotchas list for troubleshooting tips.

19 thoughts on “Mozilla Lets Folks Turn AI LLMs Into Single-File Executables”

John says:

December 2, 2023 at 9:47 pm

Justine Tunney is truly a treasure. Everything she does is fascinating.

Report comment

Reply
shinsukke says:

December 3, 2023 at 1:19 am

My ZOTAC 3060 died. I shall have to suffer without my LLMs, until they RMA the card and replace it.

Report comment

Reply
1. solipso says:
  
  December 3, 2023 at 3:46 am
  
  Thanks for letting us know. Now we can die peacefully.
  
  Report comment
  
  Reply
2. Miroslav says:
  
  December 3, 2023 at 5:34 am
  
  Ollama works without GPU.
  
  Report comment
  
  Reply
  1. Mohsen Shabanian says:
    
    December 3, 2023 at 4:35 pm
    
    I found Ollama to be really easy to use, but slow on a crappy old O-Della desktop.
    
    Report comment
    
    Reply
  2. combinatorylogic says:
    
    December 4, 2023 at 2:17 am
    
    It’s just a wrapper around llama.cpp. Why not using llama.cpp directly instead?
    
    Report comment
    
    Reply
rtyu5r6u5ew says:

December 3, 2023 at 2:31 am

we waiting on Serenity OS version

Report comment

Reply
Emmanuel says:

December 3, 2023 at 5:32 am

Just what we needed, you’re a treasure. Keep up the good work 👍

Report comment

Reply
PWalsh says:

December 3, 2023 at 7:26 am

A very good project.

I’ve been trying to keep up with the AI revolution, and my system is littered with dozens of packages and frameworks needed to run some of the AI systems. Python is a nice language, but the library and language version inconsistencies are so bad that you now have to install a package (conda) that sets up the specific language environment versions for any AI thing you want to run.

This seems to be a problem in the scientific community, where results are calculated using a certain version of Python and libraries, and then 5 years later no one can reproduce the results (of a published paper) because older binaries are no longer available.

It’s also a problem with basic linux: if you happen to be using a major distro such as Mint, the newest versions of any library might not be in the current distro, you may need to wait 6 months in order to run something natively… or you could try adding the repo for that one thing you need, do the install, and hope it doesn’t crash your system.

Which is why people are distributing flatpaks for programs now, which are programs with statically compiled libraries so that everything you need is in one really huge executable. I like the convenience of not having to spend hours installing new things just to run an application, but boy those apps are huge!

Anyway, this looks like a very good project that will reduce some of the friction of using AI.

Report comment

Reply
1. John says:
  
  December 4, 2023 at 6:51 am
  
  We did statically linked kernels and programs back in the day, specifically. We’d get consistent results and reliability that way. Bigger, for sure, but we knew it was going to “behave” in prod.
  
  Report comment
  
  Reply
  1. a says:
    
    December 4, 2023 at 4:43 pm
    
    Dynamic linking was a mistake
    
    Report comment
    
    Reply
2. Celeste Weingartner says:
  
  December 4, 2023 at 3:28 pm
  
  Imagine considering mint a major distro. Lol.
  
  Report comment
  
  Reply
None says:

December 3, 2023 at 8:32 am

That 4GiB limit could have been circumvented if the data is appended to the executable (after the PE file end, as specified in the PE headers). The the PE loader will simply ignore the additional data/size.

Worst case, extract the small exe on the fly, and let it read from the large archive file, without extracting the data.

Report comment

Reply
1. CRJEEA says:
  
  December 3, 2023 at 9:18 am
  
  It would be interesting to make a LLM as a binary blob and then have the majority of the data as a separate set of files, rather than lumped into a single executable.
  I wonder how much of a lobotomy you can give a LLM, with fixing the file header, before it begins to stumble for the majority of output cases.
  
  Report comment
  
  Reply
  1. CRJEEA says:
    
    December 3, 2023 at 9:26 am
    
    Note to self, readme contains, “Using llamafile with external weights.”.
    
    Report comment
    
    Reply
2. John says:
  
  December 4, 2023 at 6:57 am
  
  I’d not thought of that (set the end of the executable part in the header); good idea.
  
  Now I’m curious if you can do “memory mapped files” (Windows) for the archive. I was having odd problems with a game and when down the rabbit hole I found they had used that feature with their .pak files. (Unrelated to the problem.)
  
  Report comment
  
  Reply
CRJEEA says:

December 3, 2023 at 9:13 am

I’m now wondering if someone has ever created a program that can cut up executables that are larger than 4GB [or mask their size from Windows] and still allow them to run normally.

Report comment

Reply
FatalKeystroke says:

December 3, 2023 at 2:10 pm

How long before a cyber criminal holds them into apps and scripts on the local PC and turns those executables into their peons though?

Report comment

Reply
PFnove says:

April 2, 2024 at 2:17 pm

my cheap phone with a dimensity 700 something cpu and 4gb of ram should be able to run some models locally, i’ll update if it actually works

Report comment

Reply

Hackaday

Mozilla Lets Folks Turn AI LLMs Into Single-File Executables

19 thoughts on “Mozilla Lets Folks Turn AI LLMs Into Single-File Executables”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Mining And Refining: Uranium And Plutonium

Programming Ada: First Steps On The Desktop

The Hunt For MH370 Goes On With Barnacles As A Lead

MXM: Powerful, Misused, Hackable

VCF East 2024 Was Bigger And Better Than Ever

Our Columns

Welcome Back, Voyager

Hackaday Podcast Episode 268: RF Burns, Wireless Charging Sucks, And Barnacles Grow On Flaperons

This Week In Security: Cisco, Mitel, And AI False Flags

Keebin’ With Kristina: The One With The Transmitting Typewriter

Supercon 2023: Alex Lynd Explores MCUs In Infosec

19 thoughts on “Mozilla Lets Folks Turn AI LLMs Into Single-File Executables”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns