For most home users running Ollama, two or three models cover all practical use cases: a general-purpose 7B or 8B model, a coding-specialist model, and optionally a small fast model for quick tasks on constrained hardware. The Ollama model library lists hundreds of models, including dozens of variants for the same base model at different quantisation levels. The choice that matters most is the base model family, not the specific variant. Llama 3.1 8B handles general conversation well. DeepSeek Coder V2 16B is the strongest coding model that fits comfortably in 16GB of RAM. Phi-3 Mini runs usably fast on hardware with as little as 4GB available for inference.
In short: Start with ollama pull llama3.1:8b for general use. Add ollama pull qwen2.5-coder:7b for coding. Use ollama pull phi3:mini if you are on constrained hardware with less than 8GB available for inference. Everything else is an upgrade path once you know what you actually need.
How Ollama Model Names Work
Ollama model names follow the pattern family:size-quantisation. For example, llama3.1:8b-instruct-q4_K_M refers to the Llama 3.1 family, the 8-billion-parameter instruct variant, at Q4_K_M quantisation. When you pull a model without specifying a quantisation tag (for example, ollama pull llama3.1:8b), Ollama downloads the default variant, which is typically Q4_K_M or an equivalent. That default is a good starting point for most hardware.
The instruct suffix indicates a model fine-tuned to follow instructions and respond to conversational prompts, as opposed to a base model trained only for text completion. For almost all practical use cases, you want the instruct variant. Ollama's default for most models is the instruct version, so pulling without a specific tag usually gives you the right thing. See the full guide on LLM quantisation levels for a detailed breakdown of what Q4, Q6, and Q8 mean for model quality and RAM requirements.
General-Purpose: Llama 3.1 8B
Meta's Llama 3.1 8B instruct model is the default starting point for home AI users and the most widely used open model in the Ollama ecosystem. It handles general conversation, summarisation, question-answering, writing assistance, and basic reasoning well. At Q4_K_M quantisation, it requires approximately 4.7GB of RAM, which fits in systems with 8GB or more total RAM with adequate headroom.
Pull command: ollama pull llama3.1:8b
Llama 3.1 supports a 128,000-token context window in its full form, though Ollama's default context length is 2,048 tokens unless configured otherwise. If you need to process long documents, set the context length in the Ollama model configuration. The Llama 3.1 family also includes a 70B parameter version for hardware with sufficient RAM (35 to 40GB available), which offers substantially better reasoning and instruction-following capability at the cost of significantly slower inference.
Coding: Qwen2.5-Coder 7B and DeepSeek Coder V2
General-purpose models handle basic code tasks adequately, but models trained specifically on code perform noticeably better for debugging, code generation, and code explanation. Two models stand out at the 7B to 16B range that fits in typical home hardware.
Qwen2.5-Coder 7B from Alibaba is the strongest 7B-class coding model available at this size. It fits in the same RAM envelope as Llama 3.1 8B and handles code completion, debugging, and code explanation across multiple programming languages including Python, JavaScript, Go, Rust, and others. It is the first choice for coding assistance on hardware with 8 to 16GB available for inference.
Pull command: ollama pull qwen2.5-coder:7b
DeepSeek Coder V2 16B requires approximately 10GB of RAM at Q4 quantisation and delivers a meaningful step up in code quality and reasoning ability over 7B models. If your hardware has 16GB or more available for inference (32GB total RAM recommended), DeepSeek Coder V2 is the coding model to use for complex tasks.
Pull command: ollama pull deepseek-coder-v2:16b
Low-RAM Hardware: Phi-3 Mini and Gemma 2 2B
For hardware with limited RAM available for inference (NAS devices, mini-PCs with 8GB total, or systems with heavy background processes), models below 4B parameters provide useful responses at minimal RAM cost.
Microsoft Phi-3 Mini (3.8B) was designed specifically for efficient inference on constrained hardware. Despite its small size, it handles question-answering, summarisation, and basic reasoning well. At Q4 quantisation it requires approximately 2.3GB of RAM, leaving headroom even on a NAS with 8GB total RAM sharing with the OS and running services.
Pull command: ollama pull phi3:mini
Google Gemma 2 2B is even smaller at approximately 1.6GB at Q4 quantisation and generates tokens quickly enough for interactive use even on slower hardware. It is more limited in reasoning capability than Phi-3 Mini but may be the only viable option on hardware with less than 4GB available for inference.
Pull command: ollama pull gemma2:2b
Multilingual and Multi-Task: Qwen2 7B
Llama 3.1 and most Western models are trained predominantly on English text. For multilingual use cases including Chinese, Japanese, Korean, Spanish, French, German, and others, the Qwen2 family from Alibaba is the strongest open-source option at each parameter tier. Qwen2 7B handles multilingual conversation, translation, and instruction-following significantly better than Llama-family models at the same size.
Pull command: ollama pull qwen2:7b
If you are running inference primarily in English, Llama 3.1 8B will outperform Qwen2 7B on most English-language benchmarks. If you need strong multilingual capability or work primarily in a non-English language, Qwen2 7B is the better default choice.
Writing and Creative Work: Mistral 7B
Mistral 7B is a well-balanced general-purpose model that many users find produces more fluid and natural-feeling prose than the Llama family at the same parameter count. For creative writing, blog post drafting, email writing, and similar tasks where the quality of the generated text matters more than technical accuracy, Mistral 7B Instruct is worth comparing against Llama 3.1 8B directly to see which output style you prefer.
Pull command: ollama pull mistral:7b
At this point, model preference for writing tasks is genuinely subjective. Try both and compare the output on your actual use cases. The RAM requirement is nearly identical between Mistral 7B and Llama 3.1 8B at Q4, so there is no hardware cost to keeping both available.
Model Comparison by Use Case and Hardware
Ollama Model Selection by Use Case and RAM
| Model | RAM required (Q4) | Primary strength | Pull command | |
|---|---|---|---|---|
| Gemma 2 2B | ~1.6GB | Minimum viable, very fast | ollama pull gemma2:2b | |
| Phi-3 Mini (3.8B) | ~2.3GB | Low-RAM systems, reasoning | ollama pull phi3:mini | |
| Mistral 7B Instruct | ~4.1GB | Writing, fluid prose | ollama pull mistral:7b | |
| Llama 3.1 8B Instruct | ~4.7GB | General purpose, default choice | ollama pull llama3.1:8b | |
| Qwen2 7B Instruct | ~4.5GB | Multilingual, Chinese/Japanese/Korean | ollama pull qwen2:7b | |
| Qwen2.5-Coder 7B | ~4.5GB | Code generation and debugging | ollama pull qwen2.5-coder:7b | |
| DeepSeek Coder V2 16B | ~10GB | Best coding under 32B | ollama pull deepseek-coder-v2:16b | |
| Llama 3.1 70B Instruct | ~40GB | Frontier-class open model | ollama pull llama3.1:70b |
Which Hardware Can Run Which Models
| NAS with 8GB RAM (4 to 6GB available) | Gemma 2 2B or Phi-3 Mini only. 7B models may load but will be too slow for practical use. |
|---|---|
| Mini-PC or NAS with 16GB RAM (10 to 12GB available) | All 7B and 8B models comfortably. DeepSeek Coder V2 16B at the limit. No 13B+ models. |
| 32GB RAM (25 to 28GB available) | All 7B and 13B models comfortably. DeepSeek Coder V2 16B with headroom. 30B models at aggressive quantisation. |
| 64GB RAM (55 to 60GB available) | All models up to 70B at Q4 quantisation. Llama 3.1 70B runs, slowly on CPU-only hardware. |
| 8GB VRAM GPU | 7B models fully in VRAM at Q4 (fast). 13B models at the limit. Much faster than CPU-only inference. |
| 16GB VRAM GPU | 13B models fully in VRAM. 34B models at aggressive quantisation. Significant speed advantage over CPU. |
Managing Models in Ollama
Ollama stores downloaded models in a local model directory. On Linux and macOS, the default location is ~/.ollama/models/. On Windows it defaults to a system path. Model files are large: a 7B Q4 model is 4 to 5GB, a 13B model is 8 to 9GB. Keeping five or six models takes 25 to 40GB of disk space. If storage is tight, remove models you are not using with ollama rm modelname and re-pull them when needed.
To see all models currently downloaded: ollama list
To run a model directly from the terminal: ollama run llama3.1:8b
To see which models are currently loaded in memory: ollama ps
Ollama keeps recently used models loaded in memory for a short period to avoid reload overhead. On low-RAM systems, only one model can be loaded at a time. On systems with sufficient RAM, Ollama can keep multiple models loaded simultaneously, switching between them without a reload delay.
Which Ollama model is best for beginners?
Start with ollama pull llama3.1:8b. It is the most widely used model in the Ollama ecosystem, handles general conversation well, runs on any hardware with 8GB or more of RAM, and has extensive documentation and community support. If you find it too slow on your hardware, switch to phi3:mini for a smaller and faster alternative. If you need coding assistance specifically, add qwen2.5-coder:7b as a second model.
How do I choose between Llama 3.1 and Mistral for writing tasks?
Download both and try them on your actual writing tasks. Pull Llama 3.1 with ollama pull llama3.1:8b and Mistral with ollama pull mistral:7b. Ask both the same prompt and compare the output. Some users prefer Mistral's output style for prose and creative writing. Others find Llama 3.1 more accurate for factual tasks. The hardware cost of keeping both is identical. Approximately 4 to 5GB each. So there is no reason not to test both.
Can I run more than one Ollama model at the same time?
Yes. Ollama keeps recently used models in memory and switches between them when requests arrive. How many models can coexist in memory simultaneously depends on your RAM. On 16GB systems, you can typically keep one 7B model loaded. On 32GB systems, two 7B models can coexist. Ollama manages the loading and unloading automatically. If you switch between models frequently, you will see a brief reload delay (5 to 30 seconds) when a model has been evicted from memory. You can also reduce the model keep-alive timeout if you want Ollama to free memory more aggressively between requests.
What is the difference between a base model and an instruct model in Ollama?
A base model is trained on large amounts of text to predict the next token. It generates text continuations, not responses to questions. An instruct model is the same base model further trained with instruction-following data, which teaches it to respond helpfully to prompts rather than just continuing text. For all practical conversational use. Chat interfaces, question-answering, document summarisation. You want the instruct variant. Ollama defaults to instruct models for most model pulls. Base models are used primarily for further fine-tuning, not for interactive use.
How often should I update my Ollama models?
Model families update periodically with improved versions. ollama pull modelname checks for an updated version of the same model and downloads it if available. For active model families like Llama and Qwen, new versions appear every few months with meaningful capability improvements. Checking for updates once a month with a pull command for your regularly used models is sufficient. Ollama does not auto-update models. You need to run the pull command manually or script it. The update replaces the old version in the models directory, recovering the disk space from the previous version.
Need a chat interface to use these models? Open WebUI runs in Docker on a NAS or mini-PC and serves every device on your network from a single installation.
Read the Open WebUI Guide