Quantisation is how much you compress a language model's weights to fit in RAM. Lower quantisation means smaller file size, less RAM needed, and faster inference on limited hardware. But slightly reduced output quality. Higher quantisation preserves more of the original model quality but requires more RAM and runs slower. For most home setups running on a NAS or mini-PC, Q4_K_M is the right balance.
In short: Use Q4_K_M as your default. It fits in 6-8GB RAM for a 7B model, runs at acceptable speed on CPU hardware, and the quality difference versus the full-precision model is not noticeable in everyday use. Use Q8_0 if you have plenty of RAM and want maximum quality without a GPU. Avoid Q2 and Q3. The quality degradation is significant.
What Is Quantisation?
When an AI model is trained, its weights. The billions of numerical values that encode what the model knows. Are stored at full 32-bit or 16-bit precision. A 7B parameter model at 16-bit (FP16) takes approximately 14GB of storage and RAM. Most home hardware does not have 14GB free for a single model.
Quantisation reduces the bit depth of those weights. Instead of storing each value as a 16-bit float, you store it as a 4-bit or 8-bit integer (or a mix). The model becomes smaller and faster to load, but with some loss of precision. The question is how much quality you give up at each quantisation level. And the answer is: less than you probably expect, at the levels most people actually use.
Quantisation Formats. What Each One Means
Ollama uses GGUF format quantisation, which is what you will encounter when downloading models from Ollama's library or Hugging Face. The naming convention describes the bit depth and the method used:
GGUF Quantisation Levels. RAM Requirements for 7B and 13B Models
| Format | Bits per Weight | 7B Model RAM | 13B Model RAM | Quality vs FP16 | |
|---|---|---|---|---|---|
| Q2_K | Q2_K | ~2.6 bits | ~3.1 GB | ~5.5 GB | Poor. Noticeable degradation |
| Q3_K_M | Q3_K_M | ~3.9 bits | ~3.9 GB | ~7.0 GB | Below acceptable for most tasks |
| Q4_K_M | Q4_K_M | ~4.5 bits | ~5.7 GB | ~8.4 GB | Very good. Recommended default |
| Q5_K_M | Q5_K_M | ~5.5 bits | ~6.7 GB | ~10.0 GB | Excellent. Near-indistinguishable |
| Q6_K | Q6_K | ~6.6 bits | ~7.7 GB | ~11.7 GB | Near-lossless for most tasks |
| Q8_0 | Q8_0 | ~8.5 bits | ~9.6 GB | ~14.6 GB | Effectively lossless |
| FP16 (full) | FP16 | 16 bits | ~14 GB | ~26 GB | Reference quality. Needs GPU RAM |
The K suffix means the quantisation uses k-means clustering to preserve the most important weight values more precisely. K_M is the medium variant. A good balance of compression and quality. K_S is small (more compressed), K_L is large (less compressed). In practice, K_M is the format you want unless you have a specific reason to choose otherwise.
Quality Difference in Practice
The quality gap between quantisation levels is real but not uniform across all tasks. Here is a practical guide to where the differences matter:
Quantisation Quality. Task-by-Task Impact
| Task | Q4_K_M | Q5_K_M / Q6_K | Q8_0 | |
|---|---|---|---|---|
| General conversation | General conversation | Excellent. No difference | Excellent | Excellent |
| Code generation | Code generation | Good. Occasional minor errors | Very good | Excellent |
| Long document summarisation | Long documents | Good. Some detail loss | Very good | Excellent |
| Mathematical reasoning | Maths | Acceptable. More errors | Better | Best CPU option |
| Creative writing | Creative writing | Very good | Excellent | Excellent |
| Instruction following | Instruction following | Very good | Excellent | Excellent |
The tasks where quantisation shows up most are those requiring precise mathematical reasoning and complex multi-step logic. For everyday chat, summarisation, and creative tasks, the difference between Q4_K_M and Q8_0 is not noticeable in typical use.
RAM Requirements by Hardware Tier
The RAM figure for running a model is the model file size plus overhead for the context window. Ollama typically needs an additional 1-2GB for the runtime itself. These are practical figures for running Ollama on a NAS or mini-PC:
RAM Required. Model Size + Quantisation
| Hardware RAM | Largest Model at Q4_K_M | Largest Model at Q8_0 | |
|---|---|---|---|
| 4GB RAM | 4GB | 3B (just fits) | Not usable for 7B+ |
| 6GB RAM | 6GB | 7B (fits, 1-2GB headroom) | 3B only |
| 8GB RAM | 8GB | 7B (comfortable) | 3B-4B |
| 16GB RAM | 16GB | 13B (comfortable) | 7B |
| 32GB RAM | 32GB | 34B (fits) | 13B (comfortable) |
| 64GB RAM | 64GB | 70B (fits) | 34B |
Context window also uses RAM. The figures above assume a moderate context window (2K-4K tokens). If you increase the context window to 8K or 32K tokens in Ollama, RAM usage increases significantly. A 7B model at Q4_K_M with a 32K context window needs 12-14GB of RAM, not 6GB. Keep context window at the default unless your use case requires long conversations.
Speed Difference Between Quantisation Levels
Lower quantisation also means faster inference, because there is less data to process per token. The difference is measurable but not dramatic on modern hardware:
Inference Speed by Quantisation. Llama 3 7B on AMD Ryzen V1500B (CPU-only)
| Format | Tokens/second | Relative to Q4_K_M | |
|---|---|---|---|
| Q4_K_M | Q4_K_M | 10-14 tok/s | Baseline |
| Q5_K_M | Q5_K_M | 8-11 tok/s | ~20% slower |
| Q6_K | Q6_K | 7-10 tok/s | ~30% slower |
| Q8_0 | Q8_0 | 5-8 tok/s | ~45% slower |
On a NAS with a Ryzen V1500B (QNAP TS-473A, TS-873A), Q8_0 on a 7B model produces 5-8 tok/s. Below the comfortable threshold for chat. Q4_K_M at 10-14 tok/s is noticeably more usable. On a faster mini-PC (Beelink EQR6 at 20-28 tok/s on Q4_K_M), the Q8_0 slowdown to 12-16 tok/s is still comfortable.
Which Format to Choose for Your Setup
A few common scenarios:
- NAS with 8GB RAM (DS925+, TS-473A): Use Q4_K_M for 7B models. Q8_0 will not leave enough headroom for the OS and other containers. Load one model at a time.
- NAS with 16GB RAM: Q4_K_M for 13B models, or Q8_0 for 7B. Choose based on whether you prefer quality or model size.
- Mini-PC with 32GB DDR5: Run Q5_K_M or Q6_K for 7B models with plenty of headroom, or Q4_K_M for 13B models. The speed is comfortable at all formats on this hardware tier.
- Mini-PC with 64GB + GPU VRAM: Use FP16 on GPU where it fits in VRAM. Fall back to Q4_K_M for models too large for VRAM.
- Photo AI (Immich, Synology Photos): These tools handle their own model quantisation. You do not select this manually. The application does it internally.
How to Select a Quantisation in Ollama
When you run ollama pull llama3, Ollama downloads the Q4_K_M version by default. To specify a different quantisation:
ollama pull llama3:7b-instruct-q8_0
Or in your Modelfile:
FROM llama3:7b-instruct-q4_K_M
You can see available quantisation options for any model by visiting the model's page at ollama.com and checking the tags list. Not all models have all quantisation levels available. The maintainer publishes the variants they consider most useful.
For most home setups running Ollama on a NAS, sticking with the Ollama default (Q4_K_M) is the right call. The exception is if you have 16GB+ RAM, are doing technically demanding tasks like code generation or maths, and can accept a 20-30% speed reduction in exchange for better output quality. See the mini-PC vs NAS for local AI guide for hardware context, or the best NAS for AI in Australia for current model recommendations.
Australian Buyers: What You Need to Know
Quantisation is a software choice, not a hardware purchase. However, the RAM ceiling of your hardware determines which quantisation levels are viable. Australian buyers planning an AI NAS upgrade should note:
- Synology DS925+ and DS425+ both support RAM upgrades. The DS925+ takes up to 32GB, the DS425+ up to 16GB. Purchase additional RAM from Mwave, Scorptec, or PLE to get the right SO-DIMM spec.
- QNAP TS-473A ships with 8GB ECC DDR4 and supports 64GB. Upgrading to 16-32GB significantly expands your quantisation options for larger models.
- Mini-PCs from Beelink and Minisforum typically ship with 32GB DDR5. More than enough for Q4_K_M on 13B models. Check the listing carefully as some base configurations ship with 16GB.
Related reading: our NAS buyer's guide.
What does Q4_K_M mean in Ollama?
Q4_K_M is a GGUF quantisation format. Q4 means each model weight is stored at approximately 4-bit precision instead of the original 16-bit. K means it uses k-means clustering to preserve the most important weights more accurately. M is the medium variant of this approach, balancing compression ratio against quality. It is the recommended default for most home hardware because it fits 7B models in 6-8GB RAM at comfortable inference speeds.
Is Q8 noticeably better than Q4 for everyday use?
For general conversation, summarisation, and creative writing: no. Most users cannot distinguish Q4_K_M and Q8_0 output in everyday tasks. The difference is more noticeable for maths problems, complex multi-step reasoning, and precise code generation. If those tasks matter to you and your hardware has enough RAM, Q8_0 or Q6_K is worth trying. Otherwise, Q4_K_M is sufficient.
How much RAM do I need to run a 7B model?
At Q4_K_M quantisation, a 7B model requires approximately 5.5-6.5GB of RAM for the model itself, plus 1-2GB overhead for Ollama and the OS. A total of 8GB RAM is the practical minimum, with 16GB recommended for comfortable operation and room for other tasks. At Q8_0, the same 7B model needs 9-10GB. A 16GB system is required.
What is the difference between Q4_K_M and Q4_K_S?
Both use 4-bit k-means quantisation. K_S (small) is more aggressively compressed. Slightly smaller file size and faster inference, but slightly more quality loss. K_M (medium) balances compression and quality better. K_L (large) is the least compressed 4-bit variant with the best quality. For most home users, K_M is the right choice. It is the Ollama default and the most widely tested variant.
Can I run Q4 on a NAS with 4GB RAM?
A 7B model at Q4_K_M requires 6-8GB of RAM in practice, so 4GB is not enough. You could run a 3B model at Q4_K_M (approximately 2.5GB) on a 4GB NAS, leaving just enough headroom for the OS. In practice, a 4GB NAS is below the useful threshold for text LLMs. It can still run lightweight AI tasks like Immich facial recognition and photo classification, which use smaller specialised models that do not go through Ollama.
Check how much RAM your NAS or mini-PC has and which model sizes you can run at different quantisation levels.
Mini-PC vs NAS for Local AI