GenAI: Optimizing Local Large Language Models Performance

Local large language models (LLMs) are becoming more capable and accessible than ever β but to truly unlock their power, you need to know how to optimize them. Whether you’re running AI on an M-series MacBook or a 2019 Intel machine, there are tricks to get better speed, quality, and control.
In the final session of our GenAI series (after covering Local Dev, RAG, and Agents), we explored the real-world performance tuning of local LLMs. Here’s what we covered π
1. Running LLMs on Your Laptop
Apple Silicon vs Intel Macs
Feature | Apple Silicon (M1/M2/M3/M4) | Intel Mac (e.g. 2019 MBP) |
---|---|---|
CPU & GPU |
Unified on one chip (SoC) |
Separate chips for CPU and GPU |
Memory |
Shared Unified Memory Architecture |
RAM for CPU, VRAM for GPU |
Model performance |
Fast, efficient, great for LLMs |
Slower, GPU mostly unused |
Note
|
M-series chips shine for local AI workloads thanks to unified memory. Intel Macs can still work well but rely entirely on CPUs and may struggle with larger models. |
What Can You Run?
Models are measured by parameters, typically in billions (e.g., 3B, 7B, 13B). More parameters = more "intelligence" β but also more memory use.
Use quantization (more below!) to fit larger models, even on machines with 8β16 GB RAM.
2. Quantization: Making LLMs Lighter
What is Quantization?
Quantization reduces the precision of the numbers in a model β trading a little accuracy for huge gains in size and speed.
Format | Description | Typical Use |
---|---|---|
FP32 |
Full precision |
Training |
FP16 |
Half precision |
Deployment |
INT8 |
Quantized |
Light inference (some loss) |
INT4 |
Heavily quantized |
Fast local inference |
INT2β3 |
Experimental |
Ultra-lightweight, niche |
LLaMA2-70B goes from 138GB (FP16) to 42GB (INT4) β a huge saving!
What is GGUF?
GGUF
stands for GPT-Generated Unified Format.
Itβs a modern, standardized file format designed for running quantized large language models (LLMs) locally.
Why GGUF?
GGUF makes loading and using models in tools like llama.cpp
, Ollama
, or LM Studio
.
Feature | Benefit |
---|---|
π§³ Compact |
Stores quantized models in a smaller, efficient format |
π§ Self-contained |
Includes weights, tokenizer, metadata, and vocab in one file |
β‘ Fast |
Optimized for quick loading and inference |
π§ Flexible |
Supports various quantization formats like |
Tools that Support GGUF
-
llama.cpp
β Core backend for CPU/GPU inference -
Ollama
β CLI + API for local models -
LM Studio
β Desktop GUI for Mac/Windows/Linux -
text-generation-webui
β Powerful browser-based frontend
Example GGUF Model Filename
mistral-7b-instruct-v0.1.Q4_K_M.gguf
Part | Description |
---|---|
|
Model type and size |
|
Model version |
|
Quantization type (4-bit, medium group) |
|
File format extension |
Think of a
.gguf
file as an AI-in-a-box β it includes everything the model needs to run locally, compressed and ready to go.
K-Quant Types (for GGUF Models)
When downloading quantized models (especially in GGUF format), you’ll often see suffixes like Q4_K_S
, Q5_K_M
, or Q6_K_L
.
These aren’t just random labels β they define how the quantization is applied, and they directly affect model quality vs. performance.
The K
in these names refers to the quantization group size and method, which influences:
-
how much precision is preserved;
-
how fast the model can run;
-
how much RAM is required.
Suffix | Description |
---|---|
|
Small group quantization Fastest and most memory-efficient. It compresses more aggressively, which means it’s ideal when you’re on a tight memory budget or running on weaker hardware β but output quality might noticeably degrade on complex tasks. |
|
Medium group quantization A great balance of speed, memory use, and output quality. It’s the "safe middle ground" for most users running 4β7B models locally. If you’re unsure where to start, this is a solid default. |
|
Large group quantization More conservative compression β keeps more precision, giving higher-quality outputs. Slower and uses more RAM, but closer to non-quantized behavior. Ideal for tasks requiring accurate and nuanced responses. |
The difference isn’t just in speed β it’s also in how much subtle detail the model retains in its responses.
βThink of
K_S
,K_M
, andK_L
as image compression presets: Small is low-res JPEG, Medium is standard HD, Large is almost RAW quality.β
What is Quantization used for?
Quantization is a trade-off between size, speed, and accuracy. For some perspective, hereβs a rough guide on how quantization is used in the LLM world:
Task | Quantization |
---|---|
Model training |
FP32 (that’s why it is so expensive to train models, to get that knowledge accurate) |
For deployment |
Usually FP16 (for speed) |
Local models |
Unused INT4 (for speed and accuracy) |
Why Not Just Use a Smaller Model?
-
Models under 3B can run easily β but often lack reasoning or language nuance.
-
Quantization gives you the best of both worlds: keep a 7B+ modelβs brain but shrink the size.
βItβs like watching a 4K movie compressed to 1080p β smaller, still looks good.β
Quantized Models In Action (Example)
In this example, we consider qwen2.5, a 14B model with quantized versions available in Ollama
.
We will focus on different quantization levels of the 14B model.
Let’s have a look at how different models deal with the following prompt:
Explain recursion to a 10-year-old in one paragraph.
To run the models, execute one of the following prompts, starting with Q4_0:
ollama run qwen2.5:14b-instruct-q4_0
ollama run qwen2.5:14b-instruct-q8_0
ollama run qwen2.5:14b-instruct-q2_K
ollama run qwen2.5:14b-instruct-fp16
For example, let’s start with Q4_0:
ollama run qwen2.5:14b-instruct-q4_0
Now that the model is loaded, we can run the prompt:
"Explain recursion to a 10-year-old in one paragraph."
Pay attention to the response time and the memory usage. Compare it with the other models.
You may have noticed there is not much difference in quality between the Q4_0 and Q8_0 models, but the Q2_K model is much faster and smaller. Perfect for showing the quality/speed trade-off in action, and how to adjust for your needs. This does not necessarily mean that the behavior is the same for other prompts or tasks. You have to try this for yourself and see what works best for you on your machine.
3. Tuning Parameters in Ollama
Using Ollama? You can change some parameter settings of the local models based on your preference, for example, a more deterministic response, or a more creative one.
Let us consider the llama3
model, we can run it with the following command:
ollama run llama3
You can set these parameters after the model is loaded:
/set parameter <parameter> <value>
So, for example, to set temperature
to 1.0:
/set parameter temperature 1.0
And then we set top_p
to 0.9:
/set parameter top_p 0.9
In the example above, we set the temperature
to 1.0, a more creative response, and top_p
to 0.9, a more deterministic response.
Parameter temperature
adds randomness.
The lower the value, the more focused and deterministic the model response.
The higher the value, the more creative and varied the response.
Parameter top_p
picks from the smallest possible set of words whose cumulative probability adds up to p
.
It controls diversity β higher values mean more diverse and creative responses, and lower values make responses more focused.
Here are some more common parameters you can tune:
Param | What it Does | Typical Values |
---|---|---|
num_ctx |
Context size (how much it remembers) |
2048β4096 |
top_k |
Limits top options for output |
40β100 |
top_p |
Controls diversity |
0.8β0.95 |
temperature |
Controls creativity |
0.6β0.8 (chat), 0.3β0.6 (code) |
repeat_penalty |
Avoids repeating phrases |
1.1β1.3 |
threads |
Number of CPU threads (config only) |
Match to physical cores |
If you want to learn more about the parameters, you can find some extra information here.
Resources
Final Thoughts
Running LLMs locally is no longer science fiction β it’s practical, efficient, and private. With just a bit of tuning and the right model format, your laptop becomes an AI powerhouse.