Open WebUI is a self-hosted, ChatGPT-style interface for running large language models locally. It connects to Ollama (a local LLM runtime) and provides a polished chat interface for models like Llama 3, Mistral, Phi-3, and others. Running it on a NAS means your AI assistant is always available on your local network, requires no API key or subscription, and keeps all data local. This guide covers deploying the Open WebUI + Ollama Docker stack on a NAS, model selection for NAS hardware, and the GPU passthrough configuration that significantly improves inference speed on capable hardware.
In short: Deploy Ollama and Open WebUI as Docker containers on your NAS, pull a model (start with Llama 3.2 3B or Phi-3 Mini for resource-constrained NAS hardware), and access the chat interface at port 3000. LLM inference on CPU is slow. Expect 2-8 tokens/second on a Celeron NAS. GPU passthrough on QNAP PCIe models or dedicated GPU hardware dramatically improves speed.
Hardware Reality: What to Expect from NAS LLMs
NAS hardware is not designed for LLM inference. Setting realistic expectations:
- Intel Celeron N5095 (TS-464, DS423+): CPU-only inference at 2-5 tokens/second for a 3B parameter model. Usable for non-time-critical tasks. Not suitable for interactive conversation with large (7B+) models
- AMD Ryzen R1600 (DS923+, TS-473A): Slightly faster at ~5-8 tokens/second for 3B models. The integrated GPU can assist with some models
- QNAP with PCIe GPU (TS-473A + NVIDIA GPU): Adding a dedicated GPU via PCIe (NVIDIA RTX 3060/4060 in a compatible half-height format) enables GPU inference. 40-80+ tokens/second for 7B models. This is the correct hardware approach for real-time LLM use on NAS hardware
For casual, non-real-time use (asking questions and waiting 30-60 seconds for a full response), CPU-only inference on a Celeron NAS is functional. For interactive use, a GPU or dedicated inference hardware is needed.
Step 1: Deploy Ollama and Open WebUI
Create a Docker Compose file at /volume1/docker/openwebui/docker-compose.yml (Synology) or /share/docker/openwebui/docker-compose.yml (QNAP):
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ./ollama:/root/.ollama
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
depends_on:
- ollama
ports:
- 3000:8080
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- ./open-webui-data:/app/backend/data
restart: unless-stoppedDeploy with docker compose up -d. First startup downloads the Open WebUI image (~1.5GB) and starts both services. Access Open WebUI at http://[NAS-IP]:3000. Create an admin account on first access.
Step 2: Pull a Model
After Open WebUI loads, pull a model from Ollama's library. Model selection depends on your NAS RAM:
- 4GB RAM available for Ollama: Use Phi-3 Mini (3.8B parameters, ~2.3GB) or Llama 3.2 3B (~2GB). These are the smallest capable models
- 8GB RAM available: Use Llama 3.2 3B or Mistral 7B (~4.1GB). Mistral 7B is significantly more capable than 3B models
- 16GB RAM available: Use Llama 3.1 8B (~4.7GB) or Mistral 7B. More comfortable headroom
To pull a model in Open WebUI: Admin Settings → Models → pull from Ollama library. Enter the model name (e.g. llama3.2:3b) and click Pull. Model downloads from Ollama's registry. Sizes range from 2GB to 70GB+. First pull may take 20-60 minutes depending on model size and internet speed.
Alternatively, pull from the Ollama container CLI: docker exec -it ollama ollama pull llama3.2:3b
Step 3: GPU Passthrough (QNAP PCIe Models)
QNAP NAS models with PCIe slots (TS-473A, TS-673A) can host a GPU card for hardware-accelerated inference. NVIDIA consumer GPUs (RTX 3060/4060 in low-profile form factor) work with Ollama's CUDA backend.
To enable GPU passthrough in the Compose file, modify the Ollama service:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ./ollama:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stoppedThis requires the NVIDIA Container Toolkit installed on the NAS host. On QNAP, this is available as a QTS package for supported GPU models. After configuration, verify GPU usage: docker exec -it ollama ollama ps. Running models should show GPU allocation.
🇦🇺 Australian Users: Hardware Notes
Recommended hardware configurations for local LLM on NAS in Australia (March 2026):
- QNAP TS-473A (~$1,269) + NVIDIA RTX 3060 12GB ($1269-$1576 AUD): Best self-hosted LLM NAS platform in the current AU lineup. AMD Ryzen CPU, PCIe slot for GPU, 8GB RAM expandable. The RTX 3060 12GB handles 7B-13B models at 40-80 tokens/second. Total cost ~$1,700-1,800 AUD
- Intel Celeron NAS (TS-464, DS423+). CPU only: Usable for 3B models at 2-5 tokens/second. Acceptable for summarisation tasks and non-interactive queries. Not suitable for real-time conversation with capable models
If you want local LLM inference as a primary use case rather than an add-on, a dedicated mini-PC with integrated GPU (Intel Core Ultra or AMD Ryzen with strong integrated graphics) or a PC with a used NVIDIA card provides better price/performance than a NAS with GPU card.
See the best NAS for local LLM guide for a complete hardware comparison across AI workloads.
Related reading: our NAS buyer's guide and our NAS explainer.
Use our free NAS Sizing Wizard to get a personalised NAS recommendation.
Can I use Open WebUI with the OpenAI API instead of local models?
Yes. Open WebUI supports connecting to the OpenAI API as a backend alongside or instead of Ollama. Add your OpenAI API key under Admin Settings → Connections → OpenAI API. This lets you use GPT-4o, GPT-4 Turbo, and other OpenAI models through the same interface as local models. Useful if you want a unified chat interface for both local (private, free) and cloud (capable, paid) models depending on the task.
What is the difference between Open WebUI and ChatGPT?
Open WebUI is a self-hosted interface running models on your own hardware. ChatGPT uses OpenAI's cloud-hosted GPT models. The key differences: Open WebUI is private (data never leaves your network), free to run (no API costs once hardware is paid for), but limited by your hardware's inference speed. ChatGPT (and GPT-4) is significantly more capable than the open models available for local inference today, and responds in real-time. Local LLMs are best for private, offline, or cost-sensitive use cases; ChatGPT/Claude are better for capability-demanding tasks.
How much storage do LLM models take?
Model sizes: Phi-3 Mini (3.8B) ~2.3GB, Llama 3.2 3B ~2GB, Mistral 7B ~4.1GB, Llama 3.1 8B ~4.7GB, Llama 3.1 70B ~40GB. Models are stored in the Ollama volume mount on your NAS. For a selection of 3-4 models (one small, one medium), budget 10-15GB of NAS storage. Larger models (30B, 70B) require 20-40GB storage and 24+ GB RAM to load. Not suitable for typical NAS hardware.
Is Ollama only for NAS?
No. Ollama runs on any Linux, macOS, or Windows machine. The NAS deployment is convenient because the NAS is always on and accessible on the local network. You can query your local LLM from any device in your home without leaving a PC running. But for best performance, running Ollama on a PC or Mac with a GPU is more capable than NAS hardware. Many homelab users run Ollama on their primary PC for performance and use the NAS for everything-always-on services like Nextcloud, Immich, and Home Assistant.
Can Open WebUI be accessed remotely?
Yes. Configure HTTPS via NGINX Proxy Manager or a Cloudflare Tunnel. Same approach as other self-hosted NAS services. Once accessible via HTTPS, you can query your local LLM from anywhere. Note that remote access routes your queries through your internet connection (sending text queries out, receiving responses in). For private documents you want to keep off-internet entirely, restrict to local network access only via VPN.
Curious which NAS hardware handles local AI inference and what to expect from each model? The best NAS for local LLM guide covers hardware requirements, model selection, and GPU options.
Best NAS for Local LLM →