What LLMs Actually Run on 16GB, 32GB and 64GB RAM

A plain-English guide to which local LLM models fit in 16GB, 32GB, and 64GB RAM, what quality trade-offs to expect at each tier, and why NAS RAM constraints matter more than most guides admit.

The amount of RAM you have is the single biggest constraint on which local AI models you can run. A 7B parameter model at Q4 quantisation fits in 16GB with room to spare. A 70B model at Q4 needs around 40GB, requiring 64GB RAM to run comfortably. Understanding which models fit at each memory tier stops you from downloading a model that will either refuse to load or crawl at unusable speeds.

In short: 16GB handles 7B models well. 32GB handles 13B-14B models comfortably, and 7B at higher quality. 64GB opens up 33B-70B models at practical speeds. Beyond 64GB, you are in enterprise or multi-GPU territory. NAS devices typically cap at 16-32GB RAM, which defines their AI ceiling.

How RAM Requirements Are Calculated

The working formula is straightforward: RAM needed (GB) = (parameter count in billions) x (bits per weight) / 8 x 1.1. The 1.1 multiplier accounts for the key-value cache and runtime overhead.

A 7B model at Q4 (4 bits per weight): 7 x 4 / 8 x 1.1 = ~3.9GB for the model weights. Add 2-4GB for the context cache and OS, and you need roughly 6-8GB free RAM to run it. A 13B model at Q4 needs approximately 8-10GB. A 70B model at Q4 needs roughly 38-42GB.

This is why quantisation level matters so much. The same 7B model at Q8 (8 bits) doubles the RAM requirement versus Q4. Higher quantisation means better output quality but higher memory cost.

16GB RAM: What Actually Works

16GB is the sweet spot for most practical local AI setups in 2026. This is the RAM ceiling on many NAS devices and a common configuration for mini-PCs and older laptops running Ollama.

16GB RAM: What Runs and How Well

Model Quantisation Approx RAM Used Quality vs GPT-4
Llama 3.1 8B Q4_K_M~5.5GB~5.5GBGood for factual Q&A, coding basics
Mistral 7B Q4_K_M~4.8GB~4.8GBStrong reasoning, fast
Gemma 2 9B Q4_K_M~6.2GB~6.2GBSolid general-purpose
Phi-3 Mini 3.8B Q4_K_M~2.5GB~2.5GBFast, good for simple tasks
Llama 3.1 8B Q8_0~9.0GB~9.0GBBetter quality, needs 12GB free
Llama 3 70B Q2_K~28GB~28GBDoes NOT fit in 16GB

At 16GB, a Q4_K_M 7B-9B model leaves enough headroom for the OS, Ollama's server process, and a browser or two. This is the configuration running on most QNAP and Synology NAS devices that attempt local AI, and it works adequately for document summarisation, Q&A on local files, and basic code completion.

The ceiling at 16GB is quality. A 7B model at Q4, regardless of family, will not match GPT-4o or Claude Sonnet on complex reasoning, nuanced writing, or multi-step analysis. That gap is not a configuration problem; it is a fundamental parameter count constraint.

What 16GB cannot do: run a 13B model at usable quality, run multiple models simultaneously, or maintain long context windows (32K+ tokens) without degrading performance.

32GB RAM: The Practical Sweet Spot

32GB opens up the 13B-14B model tier, which represents a meaningful quality jump over 7B. Models like Llama 3.1 13B, Qwen 2.5 14B, and Mistral Nemo 12B run comfortably at Q4-Q6 quantisation on 32GB systems.

More importantly, 32GB lets you run a 7B model at Q8 (near-lossless quality) while still having RAM headroom for context and concurrent processes. This is a better use of 32GB than forcing a 13B model at Q3, which introduces quantisation artefacts that degrade output consistency.

32GB RAM: Practical Configurations

Model Quantisation Approx RAM Used Notes
Llama 3.1 8B Q8_0~9.0GB~9.0GBBest-quality 7B, plenty of headroom
Llama 3.1 13B (or 14B) Q4_K_M~9.0GB~9.0GBGood quality bump over 7B
Qwen 2.5 14B Q6_K~12.5GB~12.5GBStrong coding and Chinese language
Mistral Nemo 12B Q4_K_M~8.0GB~8.0GB128K context window, efficient
DeepSeek Coder 33B Q3_K_M~16GB~16GBSqueezes into 32GB, quality compromise
Llama 3 70B Q2_K~28GB~28GBTechnically fits but very poor quality

The notable mistake at 32GB is trying to run a 70B model with extreme quantisation (Q2). Q2 quantisation degrades output quality significantly, often producing responses worse than a well-configured 13B at Q4. The numbers fit but the results disappoint. If 70B quality is the target, 64GB is the right starting point.

On a NAS with 32GB RAM, the CPU becomes the bottleneck before RAM does. A Synology DS925+ with 32GB RAM running a 13B Q4 model will generate tokens noticeably slower than a mini-PC with the same RAM and an Intel Core i5/i7, because the J4125/R1600 CPU in most NAS cannot match desktop CPU inference throughput.

64GB RAM: 70B Models Become Practical

64GB is the threshold where running a 70B model at Q4 becomes practical. Llama 3.1 70B at Q4_K_M uses approximately 40GB, leaving 24GB for OS, context, and runtime. This produces quality notably closer to GPT-4o than a 7B or 13B model, particularly on complex reasoning and long-form generation tasks.

64GB also enables multi-model setups: running two 7B models simultaneously for specialised agent pipelines (one for planning, one for execution), or loading a 13B model alongside a smaller specialised model for coding or summarisation.

64GB RAM: What Opens Up

Model Quantisation Approx RAM Used Notes
Llama 3.1 70B Q4_K_M~40GB~40GBPractical 70B quality, recommended config
Llama 3.1 70B Q6_K~57GB~57GBNear-lossless at 64GB, tight headroom
Qwen 2.5 72B Q4_K_M~42GB~42GBStrong multilingual + coding performance
Mixtral 8x7B MoE Q4_K_M~28GB~28GBMoE: fast inference for similar quality
Two x Llama 3.1 8B Q4_K_M~12GB total~12GB totalMulti-agent, each model isolated

At 64GB, CPU inference speed becomes the limiting factor more acutely. A 70B model at Q4 generating tokens on a CPU alone will produce roughly 0.5-2 tokens per second depending on CPU generation and core count. This is usable for document processing and batch tasks, but is noticeably slow for interactive chat. Adding a GPU changes this equation entirely, but that requires hardware capable of GPU expansion.

NAS RAM Limits: What This Means for AI

Most consumer NAS devices ship with 4-8GB RAM and support a maximum of 16-32GB. The QNAP TS-464 supports up to 16GB. The Synology DS925+ supports up to 32GB. UGREEN's DXP6800 Pro maxes at 32GB. QNAP's workstation TVS-H series supports 64GB+.

This means the practical AI ceiling for most NAS devices is the 7B model tier at 16GB, or the 13B tier if the NAS supports 32GB and you have fully populated the RAM slots. That is adequate for local document Q&A, summarisation, and simple automation. It is not adequate for quality comparable to current frontier models.

The implication: for users whose primary goal is private local AI inference at higher quality, a dedicated mini-PC with more RAM and a faster CPU will outperform a NAS running Ollama even with equivalent RAM installed. The NAS wins when storage capacity and AI inference are both required from a single appliance.

Common Mistakes to Avoid

Mistake 1: Downloading a 70B model onto a 16GB device. Ollama will attempt to load it using swap memory, which is orders of magnitude slower than RAM. The result is a model that appears to work but generates one token every 30-60 seconds. Always check model size against available RAM before downloading.

Mistake 2: Assuming Q2 quantisation of a large model beats a well-quantised small model. A 70B model at Q2 often produces worse results than a 13B at Q4 due to severe information loss at extreme quantisation. Match the model to your RAM tier rather than forcing an oversized model in.

Mistake 3: Ignoring context window RAM overhead. A 7B model at Q4 uses ~5GB for weights, but a 32K token context window adds another 2-4GB. Long context sessions will cause swapping on systems with limited headroom even if the model base fits.

Australian Context: RAM Upgrade Costs

NAS RAM upgrade costs in Australia are reasonable for most models. DDR4 SO-DIMMs (common in QNAP and Synology 2-4 bay NAS) are available at Scorptec, PLE, and Mwave for $50-120 per 16GB stick. Expanding a DS925+ or TS-464 to 32GB costs roughly $100-200 in RAM.

Note that some Synology models use proprietary memory configurations or have limitations on third-party RAM. The DS925+ uses standard SO-DIMMs. Earlier Synology models like the DS923+ had some compatibility constraints. Check the vendor's memory compatibility list, or use Crucial's compatibility tool for your model before purchasing.

For a mini-PC targeting the 64GB tier (Beelink SER8 Pro, Minisforum HX90G), RAM is soldered on many units. Confirm the configuration at purchase; upgrading after the fact may not be possible.

Related reading: our NAS buyer's guide, our NAS vs cloud storage comparison, and our NAS explainer.

Use our free AI Hardware Requirements Calculator to size the hardware you need to run AI locally.

Can I run Llama 3 70B on 32GB RAM?

Technically yes at Q2_K quantisation, which uses approximately 28GB. However, Q2 quantisation degrades model quality significantly, often making the 70B model perform worse than a properly quantised 13B model. If 70B quality is the goal, 64GB is the correct RAM target. If you have 32GB, use a 13B model at Q4-Q6 instead.

Does GPU VRAM work the same as system RAM for local AI?

GPU VRAM and system RAM serve different roles. When a model is loaded onto a GPU, it uses VRAM. When it runs on CPU only (as on most NAS), it uses system RAM. A GPU with 8GB VRAM can run a 7B Q4 model fully on-GPU, which is much faster than the same model in system RAM. Hybrid setups (model split between VRAM and RAM) are slower than full-GPU but faster than full-CPU. See our GPU expansion guide for NAS-specific GPU details.

What is the best model to run on a NAS with 16GB RAM?

Llama 3.1 8B at Q4_K_M or Q6_K is the recommended starting point. It fits comfortably in 16GB, is well-tested with Ollama, and performs well on general tasks. Gemma 2 9B and Mistral 7B are strong alternatives. Avoid models above 9B parameters on 16GB unless you are running at Q3 or lower, which hurts quality.

How do I check how much RAM a model will use before downloading it?

On the Ollama model library page, each model variant lists the quantisation level and size. A rough rule: model size in GB is approximately equal to the RAM it will use, plus 2-3GB for context. The Ollama CLI command ollama show [model]:tag lists the model size after download. For pre-download estimates, multiply billions of parameters by quantisation bits, divide by 8, and add 20%.

Does ECC RAM matter for local AI inference?

ECC RAM matters for data integrity in storage and server workloads where a bit flip could corrupt a database or file. For AI inference, a bit flip causes an incorrect token, not data loss. ECC is not required for local LLM inference, though some QNAP workstation NAS models support it. Standard non-ECC DDR4 or DDR5 SO-DIMMs are fine for Ollama/LLM use.

Can I run local AI on a NAS with only 8GB RAM?

Yes, but the options are constrained. Phi-3 Mini (3.8B) and other sub-4B models run well on 8GB. Llama 3.2 3B is a capable general-purpose option. Standard 7B models at Q4 technically fit but leave very little headroom, and performance degrades noticeably with concurrent NAS operations. 8GB is workable for experimentation, but 16GB is the practical minimum for reliable daily use.

Want to know whether your specific NAS hardware can handle local AI workloads? The AI NAS hardware requirements guide covers CPU, RAM, NPU, and storage considerations in detail.

AI NAS Hardware Requirements