Running Local LLMs on macOS

How to install and run ChatGPT style LLM models locally on macOS

Posted: 08 Aug 2024 Last Modified: 08 Aug 2024

By Kyle Howells 5 min read

To install and run ChatGPT style LLM models locally and offline on macOS the easiest way is with either llama.cpp or Ollama (which basically just wraps llama.cpp).

llama.cpp is one of those open source libraries which is what actually powers most more user facing applications. (like ffmpeg if you do anything with audio or video)

It loads and runs LLM model files and allows you to use them via:

An OpenAI API compatible web server.
Directly from Terminal via the cli
A variety of supported UIs

It has a lot of advanced features and supports all sorts of options.

However, you still have to get the LLM models and make sure they are the correct format, and use the right chat template (which if they are in llama.cpp's file format should be included in the model file).

Ollama

When Ollama runs a model, it basically just starts the llama.cpp CLI with some default parameters.
However, Ollama doesn't just run models.

It also includes a library of pre-quantised (shrunken to load faster and fit in less vram) models available for easy download.
And a homrbrew style CLI interface for downloading the models.

Installation

Download both the GUI tool from https://ollama.com/download and install the cli tool from homebrew.

brew install ollama

Usage

Find a model you want to run.
The best current (2024-08) small models are:

Llama3.1 8b (meta)
Mistral NeMo 12b (mistral)
Gemma 2 9b (Google)

Llama3.1 is considered the best currently available open source model, but the 8b version struggles with tool usage, and the 70b is too big for a lot of computers.
Mistral NeMo 12b is slightly bigger, but handles tool usage much better.
Gemma 2 doesn't support tool usage but is a very capable general model.

Download the model

ollama pull mistral-nemo:12b

Run the Model

To run the llm directly from the command line:

ollama run mistral-nemo:12b

You'll now have an interactive cli prompt you can use to talk to the LLM.
Like a terminal version of ChatGPT.

However, you'll notice the LLMs have mostly been trained to answer with markdown, so a UI is most useful.

Start the web server

ollama serve

Interact via web API

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

As long as you have the asked for model downloaded it'll load the model and run the command.

Useful Notes

As the web API can load models on demand it can be useful to see what models are currently loaded in ram (by default kept for 4 minutes after the last API call, this can be changed).

ollama ps

NAME      	ID          	SIZE 	PROCESSOR	UNTIL
llama3:70b	bcfb190ca3a7	42 GB	100% GPU 	4 minutes from now

List models downloaded

ollama list

NAME                              	ID          	SIZE  	MODIFIED     
mistral-nemo:latest               	4b300b8c6a97	7.1 GB	7 days ago  	
llama3.1:8b                       	62757c860e01	4.7 GB	7 days ago  	
phi3:medium-128k                  	3aeb385c7040	7.9 GB	4 weeks ago 	
gemma2:latest                     	ff02c3702f32	5.4 GB	4 weeks ago 	
llama3:8b                         	365c0bd3c000	4.7 GB	4 weeks ago 	
phi3:mini-128k                    	d184c916657e	2.2 GB	4 weeks ago 	
phi3:3.8-mini-128k-instruct-q5_K_M	5a696b4e6899	2.8 GB	2 months ago	
phi3:latest                       	a2c89ceaed85	2.3 GB	2 months ago

Ram Limitations

To run an LLM model it needs to fit in your vram, the GPUs ram. On macOS the M series chips having unified memory shared between the CPU and GPU is great for this! On PC systems you are limited to the 16 GB of vRam on the RTX 4080, or 24GB of vram on the RTX 3090 and RTX 4090.

However, Macs can have as much vram as they have normal ram*. The actual limit is the system only gives about 2/3 to the GPU at max, and needs some for the rest of the system, so you don't actually get to use all of it. You can tweak the limit: sudo sysctl iogpu.wired_limit_mb=<mb> (source), but you need to be careful doing so.

All this means when picking a model to download and run, you can't just pick the biggest, and best, model.
You have to pick one that will fit inside your vram.

Models take up roughly their download size + a little extra (for the prompt and processing).

So Llama 3.1 8b takes up 16GB for the base 16 bit floating point model.
However, if the model is quantised down to storing all those model weights as 4 bit ints instead, it is now 4.7 GB.
The model gets a bit dumber doing this, but not by that much most of the time.

If you pick a model which is tool big, then llama.cpp will actually stop trying to load it into the GPU and start using CPU ram instead, doing some of the processing on the CPU and some on the GPU.
This is MUCH slower and should be avoided if you can.
Also as M chip mac's share the memory between the 2 anyway, this isn't as effective and you might still get a crash if trying to load a model that is too big.