Running AI Locally Without Spending All Day On Setup

There are many AI models out there that you can play with from companies like OpenAI, Google, and a host of others. But when you use them, you get the experience they want, and you run it on their computer. There are a variety of reasons you might not like this. You may not want your data or ideas sent through someone else’s computer. Maybe you want to tune and tweak in ways they aren’t going to let you.

There are many more or less open models, but setting up to run them can be quite a chore and — unless you are very patient — require a substantial-sized video card to use as a vector processor. There’s very little help for the last problem. You can farm out processing, but then you might as well use a hosted chatbot. But there are some very easy ways to load and run many AI models on Windows, Linux, or a Mac. One of the easiest we’ve found is Msty. The program is free for personal use and claims to be private, although if you are really paranoid, you’ll want to verify that yourself.

What is Msty?

Talkin’ about Hackaday!

Msty is a desktop application that lets you do several things. First, it can let you chat with an AI engine either locally or remotely. It knows about many popular options and can take your keys for paid services. For local options, it can download, install, and run the engines of your choice.

For services or engines that it doesn’t know about, you can do your own setup, which ranges from easy to moderately difficult, depending on what you are trying to do.

Of course, if you have a local model or even most remote ones, you can use Python or some basic interface (e.g., with ollama; there are plenty of examples). However, Msty lets you have a much richer experience. You can attach files, for example. You can export the results and look back at previous chats. If you don’t want them remembered, you can chat in “vapor” mode or delete them later.

Each chat lives in a folder, which can have helpful prompts to kick off the chat. So, a folder might say, “You are an 8th grade math teacher…” or whatever other instructions you want to load before engaging in chat.

MultiChat

What two models think about 555s

One of the most interesting features is the ability to chat to multiple chatbots simultaneously. Sure, if it were just switching between them, that would be little more than a gimmick. However, you can sync the chats so that each chatbot answers the same prompt, and you can easily see the differences in speed and their reply.

For example, I asked both Google Gemini 2.0 and Llama 3.2 how a 555 timer works, and you can see the answers were quite different.

RAGs

The “knowledge stack” feature lets you easily grab up your own data to use as the chat source (that is RAG or Retrivial Augmented Generation) for use with certain engines. You can add files, folders, Obsidian vaults, or YouTube transcripts.

Chatting about the podcast

For example, I built a Knowlege Stack named “Hackaday Podcast 291” using the YouTube link. I could then open a chat with Google’s Gemini 2.0 beta (remotely hosted) and chat with the podcast. For example:

You: Who are the hosts?

gemini-2.0-flash-exp: Elliot Williams and Al Williams are the hosts.

You: What kind of microscope was discussed?

gemini-2.0-flash-exp: The text discusses a probe tip etcher that is used to make tips for a type of microscope that can image at the atomic level.

It would be easy to, for example, load up a bunch of PDF data sheets for a processor and, maybe, your design documents to enable discussing a particular project.

You can also save prompts in a library, analyze result metrics, refine prompts and results, and a host of other features. The prompt library has quite a few already available, too, ranging from an acountant to a yogi, if you don’t want to define your own.

New Models

The chat features are great, and having a single interface for a host of backends is nice. However, the best feature is how the program will download, install, run, and shut down local models.

Selecting a new local model will download and install it for use.

To get started, press the Local AI Model button towards the bottom of the left-hand toolbar. That will give you several choices. Be mindful that many of these are quite large, and some of them require lots of GPU memory.

I started on a machine that had an NVidia 2060 card that had 6GB of memory. Granted, some of that is running the display. But most of it was available. Some of the smaller models would work for a bit, but eventually, I’d get some strange error. That was a good enough excuse to trade up to a 12GB 3060 card, and that seems to be enough for everything I’ve tried so far. Granted, some of the larger models are a little slow, but tolerably so.

There are more options if you press the black button at the top, or you can import GGUF models from places like huggingface. If you’ve already loaded models for something like ollama, you can point Msty at them. You can also point to a local server if you prefer.

The version I tested didn’t know about the Google 2.0 model. However, when adding any of the Google models, it was easy enough to add the (free) API key and the model ID (models/gemini-2.0-flash-exp) for the new model.

Wrap Up

You can spend a lot of time finding and comparing different AI models. It helps to have a list, although you can wait until you’ve burned through the ones Msty already knows about..

Is this the only way to run your own AI model? No, of course not. But it may well be the easiest way we’ve seen. We’d wish for it to be open source, but at least it is free to use for personal projects. What’s your favorite way to run AI? And, yes, we know the answer for some people is “don’t run AI!” That’s an acceptable answer, too.

16 thoughts on “Running AI Locally Without Spending All Day On Setup

  1. The corporate stewardship of AI is whats untrustworthy. Someone building AI at home, training on handwriting or thermostat data, sounds great! A giant multibillion dollar industry trying to convince us that high-performance scalable parallel-computation cases for GPU farms is worth subscribing and investing in? No thanks, every major tech trend ends up as a bloated defeatured ad platform, how will OpenAI be any different?

  2. I did the same but with text-generation-webui, which is similar in intent to automatic1111, in that it’s open source and aims to be the standard for running open AI models locally. It’s also easy to setup and run, just answer a few questions to get it up and running after it’s downloaded dependencies etc. Downloading models from hugging face is made easy, just copy/paste the username and repository and it downloads it without any issues.

    I use a 12GB 4070ti and models of around 7-14B parameters seem to work best. I use codestral 22b model at Q3 but that’s a stretch. There are several versions of quantization that other users have made that will lower the memory usage. Normally all LLMs are 16 bit floats, which means each of the x billion parameters needs 2 bytes of storage. ie an 8B fp16 LLM consumes 16GB of data. Quants of 8 reduces it to about half that size and Q4 another half of that. For 12GB the sweet spot is around 8B parameters and Q8, leaving some room for the context, which is all the ‘tokens’ that the chat contains, which all the text you and the bot types. Each token can be one to about 6 characters long, usually 2-3.
    I usually keep the n_ctx (token context) around 5000-7000. You can play with this number if you run out of vram memory. You notice a rapid slowdown once your gfx memory runs out. It starts to truncate older messages if context size is larger than n_ctx and it’ll forget what happened before and start repeating answers a lot after a while.

    Using tensorcores option speeds up processing by about a factor of 2 for nvidia cards that have them. flash_attention reduces the amount of memory used so the chat stays coherent most of the time.

    Llama 3.2 8B Q8 is great for role playing, codestral great for programming.

    But the great thing about a local LLM is being able to write your own characters and have fun chatting with them. Many of the limitations of commercial LLMs are then removed. For example, I’ve made a character that pretends to be the writer JRR Tolkien, and he answers questions about stuff that’s not in the books, non canon of course, but a lot of fun. I went on a treasure hunt adventure with Merry and Pippin at some point, looking for the buried treasure of Bilbo.
    Kunoichi model is also nice for role playing. Mistral and Qwen seem a little too formal to be able to roleplay properly.
    Note that there are 3 modes of chats: chat, chat-instruct and instruct. Instruct is as you expect, you can ask it to do stuff and it’ll do it without asserting a personality into it. This mode is also the most secured and limited. Chat-instruct follows your written character sheet to the letter and just ‘chat’ mode is more loosely based on your given character sheet, giving it more creative freedom but also ends up being more out of control.

    I haven’t been successful training a lora though, it breaks the model most of the time or just produce gibberish. And it consumes a lot of memory, you’ll need about twice the memory compared to just inference (chatting with the LLM). That means you can only train with smaller models like 3b parameters.

    You can enable it to act as an openai server so that you can connect to it from other apps like vscode or phpstorm with CodeGPT plugin with the custom openai option. It’s far from perfect though but it doesn’t send your code out there, where you have no control over it. You do need a lot of system memory though, I’ve found with an LLM of 10GB you need 32GB for the LLM alone as it caches the entire LLM in system RAM (seems to be in fp16 format so doubles Q8 models in size in ram, thought there are options to reduce cache sizes and it’ll load more from disk, slowing things down considerably), and when you run other apps you need 16GB more. I ended up with 64GB total system ram and everything runs stable while at only 32GB of system ram docker, firefox and wsl started to act weird or crashed.

    1. Still theft.

      Until we have a set of training data that is provided with FULL consent to be used for that model, we have nothing but theft.

      These generative AICOUGHfancy search engines do not create or “generate” anything. They provide search results on someone else’s data, in a format that lets the user handwaive the theft.

      1. I mean.. technically everything you create is derived from the “theft” of learning materials. I think the only logical thing to do is admit we’re in new territory here.

  3. Nice, but not FOSS and you will find it consuming huge amounts of disk space. It is also build on top of a fork of Ollama, as you will discover when you go looking for why you suddenly have 10GB less space.

  4. If I had a software that let me know what people (and smart people at that) around the world were asking of multiple AIs, and these people all agreed to share it with me according to my software’s TOS, I think I’d be a step or two ahead of collecting ideas than even the owners of Internet Browser companies are able to collect.
    Just saying.

  5. I made an AI automation system. My family and I use it all the time. I run the client all over the house and it controls lights, thermostats, media on my computer, etc. I use whisper then vector stores to find terminal commands then inference to call command line actions. I use llama or whatever else you want to put in the model.py file

    I call it twin because it was designed to be a twin of my mind so instead of talking to the AI it’s presumes it is your mind and attempts to solve discomforts. So I say man I am cold and it knows to turn up the temp instead of just talking to it. That was the idea atleast.

    Check it out here in my github: physiii/twin

  6. I have made a python library for XMPP server integration for an LLM model + stable diffusion model.

    The result is a chatbot system, which randomly decides it’s own name, personality, profile picture (SD model) and texts me on my phone (XMPP client) as a normal human friend would. It even randomly messages me about random topics (you just need to set the temperature high enough to get better randomness)

    I currently have 4 different chatbots who text me all the time. It’s fun talking to them

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.