Running Local LLMs on macOS
How to install and run ChatGPT style LLM models locally on macOS
To install and run ChatGPT style LLM models locally and offline on macOS the easiest way is with either llama.cpp or Ollama (which basically just wraps llama.cpp).
llama.cpp is one of those open source libraries which is what actually powers most more user facing applications. (like ffmpeg if you do anything with audio or video)
It loads and runs LLM model files and allows you to use them via:
- An OpenAI API compatible web server.
- Directly from Terminal via the cli
- A variety of supported UIs
It has a lot of advanced features and supports all sorts of options.
However, you still have to get the LLM models and make sure they are the correct format, and use the right chat template (which if they are in llama.cpp's file format should be included in the model file).
Ollama
When Ollama runs a model, it basically just starts the llama.cpp
CLI with some default parameters.
However, Ollama doesn't just run models.
It also includes a library of pre-quantised (shrunken to load faster and fit in less vram) models available for easy download.
And a homrbrew style CLI interface for downloading the models.
Installation
Download both the GUI tool from https://ollama.com/download and install the cli tool from homebrew.
brew install ollama
Usage
- Find a model you want to run.
The best current (2024-08) small models are:
- Llama3.1 8b (meta)
- Mistral NeMo 12b (mistral)
- Gemma 2 9b (Google)
Llama3.1 is considered the best currently available open source model, but the 8b version struggles with tool usage, and the 70b is too big for a lot of computers.
Mistral NeMo 12b is slightly bigger, but handles tool usage much better.
Gemma 2 doesn't support tool usage but is a very capable general model.
- Download the model
ollama pull mistral-nemo:12b
Run the Model
To run the llm directly from the command line:
ollama run mistral-nemo:12b
You'll now have an interactive cli prompt you can use to talk to the LLM.
Like a terminal version of ChatGPT.
However, you'll notice the LLMs have mostly been trained to answer with markdown, so a UI is most useful.
Start the web server
ollama serve
Interact via web API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{ "role": "user", "content": "why is the sky blue?" }
]
}'
As long as you have the asked for model downloaded it'll load the model and run the command.
Useful Notes
As the web API can load models on demand it can be useful to see what models are currently loaded in ram (by default kept for 4 minutes after the last API call, this can be changed).
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
List models downloaded
ollama list
NAME ID SIZE MODIFIED
mistral-nemo:latest 4b300b8c6a97 7.1 GB 7 days ago
llama3.1:8b 62757c860e01 4.7 GB 7 days ago
phi3:medium-128k 3aeb385c7040 7.9 GB 4 weeks ago
gemma2:latest ff02c3702f32 5.4 GB 4 weeks ago
llama3:8b 365c0bd3c000 4.7 GB 4 weeks ago
phi3:mini-128k d184c916657e 2.2 GB 4 weeks ago
phi3:3.8-mini-128k-instruct-q5_K_M 5a696b4e6899 2.8 GB 2 months ago
phi3:latest a2c89ceaed85 2.3 GB 2 months ago
Ram Limitations
To run an LLM model it needs to fit in your vram, the GPUs ram. On macOS the M series chips having unified memory shared between the CPU and GPU is great for this! On PC systems you are limited to the 16 GB of vRam on the RTX 4080, or 24GB of vram on the RTX 3090 and RTX 4090.
However, Macs can have as much vram as they have normal ram*.
The actual limit is the system only gives about 2/3 to the GPU at max, and needs some for the rest of the system, so you don't actually get to use all of it.
You can tweak the limit: sudo sysctl iogpu.wired_limit_mb=<mb>
(source), but you need to be careful doing so.
All this means when picking a model to download and run, you can't just pick the biggest, and best, model.
You have to pick one that will fit inside your vram.
Models take up roughly their download size + a little extra (for the prompt and processing).
So Llama 3.1 8b takes up 16GB for the base 16 bit floating point model.
However, if the model is quantised down to storing all those model weights as 4 bit ints instead, it is now 4.7 GB.
The model gets a bit dumber doing this, but not by that much most of the time.
If you pick a model which is tool big, then llama.cpp will actually stop trying to load it into the GPU and start using CPU ram instead, doing some of the processing on the CPU and some on the GPU.
This is MUCH slower and should be avoided if you can.
Also as M chip mac's share the memory between the 2 anyway, this isn't as effective and you might still get a crash if trying to load a model that is too big.