LLM Quantisation Levels Explained — Q4, Q6, Q8 and What to Actually Use

By the Need to Know IT Team · Last updated 28 March 2026

Quantisation determines how a language model's weights are stored and how much RAM it needs. Q4_K_M is the right default for most home AI setups. Here is what the other formats do and when to use them.

local AILLMquantisationOllamaAI hardwareRAM

Quantisation is how much you compress a language model's weights to fit in RAM. Lower quantisation means smaller file size, less RAM needed, and faster inference on limited hardware. But slightly reduced output quality. Higher quantisation preserves more of the original model quality but requires more RAM and runs slower. For most home setups running on a NAS or mini-PC, Q4_K_M is the right balance.

Methodology (Real-World, AU-Verified)

Need to Know IT is an independent resource focused on storage and infrastructure decisions. Recommendations are based on official specifications, vendor documentation, and real-world deployment considerations, including availability, warranty, connectivity, and running costs.

Where relevant, guidance is grounded in Australian conditions and pricing, while remaining applicable to global audiences. Our tools and calculators are designed to reflect real-world usage scenarios, not theoretical maximums.

Updates & corrections: Content is reviewed and updated as products change. If you spot an error, contact the editorial team and we'll investigate and correct it.

ⓘ

In short: Use Q4_K_M as your default. It fits in 6-8GB RAM for a 7B model, runs at acceptable speed on CPU hardware, and the quality difference versus the full-precision model is not noticeable in everyday use. Use Q8_0 if you have plenty of RAM and want maximum quality without a GPU. Avoid Q2 and Q3. The quality degradation is significant.

What Is Quantisation?

When an AI model is trained, its weights. The billions of numerical values that encode what the model knows. Are stored at full 32-bit or 16-bit precision. A 7B parameter model at 16-bit (FP16) takes approximately 14GB of storage and RAM. Most home hardware does not have 14GB free for a single model.

Quantisation reduces the bit depth of those weights. Instead of storing each value as a 16-bit float, you store it as a 4-bit or 8-bit integer (or a mix). The model becomes smaller and faster to load, but with some loss of precision. The question is how much quality you give up at each quantisation level. And the answer is: less than you probably expect, at the levels most people actually use.

Quantisation Formats. What Each One Means

Ollama uses GGUF format quantisation, which is what you will encounter when downloading models from Ollama's library or Hugging Face. The naming convention describes the bit depth and the method used:

GGUF Quantisation Levels. RAM Requirements for 7B and 13B Models

	Format	Bits per Weight	7B Model RAM	13B Model RAM	Quality vs FP16
Q2_K	Q2_K	~2.6 bits	~3.1 GB	~5.5 GB	Poor. Noticeable degradation
Q3_K_M	Q3_K_M	~3.9 bits	~3.9 GB	~7.0 GB	Below acceptable for most tasks
Q4_K_M	Q4_K_M	~4.5 bits	~5.7 GB	~8.4 GB	Very good. Recommended default
Q5_K_M	Q5_K_M	~5.5 bits	~6.7 GB	~10.0 GB	Excellent. Near-indistinguishable
Q6_K	Q6_K	~6.6 bits	~7.7 GB	~11.7 GB	Near-lossless for most tasks
Q8_0	Q8_0	~8.5 bits	~9.6 GB	~14.6 GB	Effectively lossless
FP16 (full)	FP16	16 bits	~14 GB	~26 GB	Reference quality. Needs GPU RAM

The K suffix means the quantisation uses k-means clustering to preserve the most important weight values more precisely. K_M is the medium variant. A good balance of compression and quality. K_S is small (more compressed), K_L is large (less compressed). In practice, K_M is the format you want unless you have a specific reason to choose otherwise.

Quality Difference in Practice

The quality gap between quantisation levels is real but not uniform across all tasks. Here is a practical guide to where the differences matter:

Quantisation Quality. Task-by-Task Impact

	Task	Q4_K_M	Q5_K_M / Q6_K	Q8_0
General conversation	General conversation	Excellent. No difference	Excellent	Excellent
Code generation	Code generation	Good. Occasional minor errors	Very good	Excellent
Long document summarisation	Long documents	Good. Some detail loss	Very good	Excellent
Mathematical reasoning	Maths	Acceptable. More errors	Better	Best CPU option
Creative writing	Creative writing	Very good	Excellent	Excellent
Instruction following	Instruction following	Very good	Excellent	Excellent

The tasks where quantisation shows up most are those requiring precise mathematical reasoning and complex multi-step logic. For everyday chat, summarisation, and creative tasks, the difference between Q4_K_M and Q8_0 is not noticeable in typical use.

RAM Requirements by Hardware Tier

The RAM figure for running a model is the model file size plus overhead for the context window. Ollama typically needs an additional 1-2GB for the runtime itself. These are practical figures for running Ollama on a NAS or mini-PC:

RAM Required. Model Size + Quantisation

	Hardware RAM	Largest Model at Q4_K_M	Largest Model at Q8_0
4GB RAM	4GB	3B (just fits)	Not usable for 7B+
6GB RAM	6GB	7B (fits, 1-2GB headroom)	3B only
8GB RAM	8GB	7B (comfortable)	3B-4B
16GB RAM	16GB	13B (comfortable)	7B
32GB RAM	32GB	34B (fits)	13B (comfortable)
64GB RAM	64GB	70B (fits)	34B

⚠

Context window also uses RAM. The figures above assume a moderate context window (2K-4K tokens). If you increase the context window to 8K or 32K tokens in Ollama, RAM usage increases significantly. A 7B model at Q4_K_M with a 32K context window needs 12-14GB of RAM, not 6GB. Keep context window at the default unless your use case requires long conversations.

Speed Difference Between Quantisation Levels

Lower quantisation also means faster inference, because there is less data to process per token. The difference is measurable but not dramatic on modern hardware:

Inference Speed by Quantisation. Llama 3 7B on AMD Ryzen V1500B (CPU-only)

	Format	Tokens/second	Relative to Q4_K_M
Q4_K_M	Q4_K_M	10-14 tok/s	Baseline
Q5_K_M	Q5_K_M	8-11 tok/s	~20% slower
Q6_K	Q6_K	7-10 tok/s	~30% slower
Q8_0	Q8_0	5-8 tok/s	~45% slower

On a NAS with a Ryzen V1500B (QNAP TS-473A, TS-873A), Q8_0 on a 7B model produces 5-8 tok/s. Below the comfortable threshold for chat. Q4_K_M at 10-14 tok/s is noticeably more usable. On a faster mini-PC (Beelink EQR6 at 20-28 tok/s on Q4_K_M), the Q8_0 slowdown to 12-16 tok/s is still comfortable.

Which Format to Choose for Your Setup

A few common scenarios:

NAS with 8GB RAM (DS925+, TS-473A): Use Q4_K_M for 7B models. Q8_0 will not leave enough headroom for the OS and other containers. Load one model at a time.
NAS with 16GB RAM: Q4_K_M for 13B models, or Q8_0 for 7B. Choose based on whether you prefer quality or model size.
Mini-PC with 32GB DDR5: Run Q5_K_M or Q6_K for 7B models with plenty of headroom, or Q4_K_M for 13B models. The speed is comfortable at all formats on this hardware tier.
Mini-PC with 64GB + GPU VRAM: Use FP16 on GPU where it fits in VRAM. Fall back to Q4_K_M for models too large for VRAM.
Photo AI (Immich, Synology Photos): These tools handle their own model quantisation. You do not select this manually. The application does it internally.

How to Select a Quantisation in Ollama

When you run ollama pull llama3, Ollama downloads the Q4_K_M version by default. To specify a different quantisation:

ollama pull llama3:7b-instruct-q8_0

Or in your Modelfile:

FROM llama3:7b-instruct-q4_K_M

You can see available quantisation options for any model by visiting the model's page at ollama.com and checking the tags list. Not all models have all quantisation levels available. The maintainer publishes the variants they consider most useful.

For most home setups running Ollama on a NAS, sticking with the Ollama default (Q4_K_M) is the right call. The exception is if you have 16GB+ RAM, are doing technically demanding tasks like code generation or maths, and can accept a 20-30% speed reduction in exchange for better output quality. See the mini-PC vs NAS for local AI guide for hardware context, or the best NAS for AI in Australia for current model recommendations.

Australian Buyers: What You Need to Know

Quantisation is a software choice, not a hardware purchase. However, the RAM ceiling of your hardware determines which quantisation levels are viable. Australian buyers planning an AI NAS upgrade should note:

Synology DS925+ and DS425+ both support RAM upgrades. The DS925+ takes up to 32GB, the DS425+ up to 16GB. Purchase additional RAM from Mwave, Scorptec, or PLE to get the right SO-DIMM spec.
QNAP TS-473A ships with 8GB ECC DDR4 and supports 64GB. Upgrading to 16-32GB significantly expands your quantisation options for larger models.
Mini-PCs from Beelink and Minisforum typically ship with 32GB DDR5. More than enough for Q4_K_M on 13B models. Check the listing carefully as some base configurations ship with 16GB.