How to Run Qwen 3 Locally: Complete Guide (2026)
Step-by-step instructions for running Qwen 3 models on your own hardware
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Run Qwen 3 Locally: Complete Guide (2026)
Qwen 3 is Alibaba's latest open-source large language model family and one of the strongest open-weight models available in 2026. It comes in multiple sizes, supports both dense and mixture-of-experts (MoE) architectures, and performs competitively with proprietary models like GPT-4o and Claude Sonnet on many benchmarks.
The best part: you can run it entirely on your own hardware with no API costs, no rate limits, and complete data privacy. This guide shows you how, step by step.
Qwen 3 Model Lineup
Qwen 3 comes in several sizes to fit different hardware:
| Model | Parameters | Active Params | Architecture | Min VRAM | Best For |
|---|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | 0.6B | Dense | 2 GB | Edge devices, mobile |
| Qwen3-1.7B | 1.7B | 1.7B | Dense | 4 GB | Lightweight tasks |
| Qwen3-4B | 4B | 4B | Dense | 6 GB | Balanced performance |
| Qwen3-8B | 8B | 8B | Dense | 8 GB | General use |
| Qwen3-14B | 14B | 14B | Dense | 12 GB | Strong reasoning |
| Qwen3-32B | 32B | 32B | Dense | 24 GB | Near-frontier quality |
| Qwen3-30B-A3B | 30B | 3B | MoE | 6 GB | Fast, efficient |
| Qwen3-235B-A22B | 235B | 22B | MoE | 48 GB+ | Frontier-class |
The MoE (Mixture of Experts) models are particularly interesting. Qwen3-30B-A3B has 30 billion total parameters but only activates 3 billion per token, making it fast and memory-efficient while maintaining high quality.
Method 1: Ollama (Easiest)
Ollama is the simplest way to run LLMs locally. It handles model downloading, quantization, and serving with a single command.
Install Ollama
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# macOS (Homebrew)
brew install ollama
# Windows: download from ollama.ai
Download and Run Qwen 3
# Pull and run Qwen 3 8B (recommended starting point)
ollama run qwen3:8b
# Other sizes
ollama run qwen3:0.6b # Very small, fast
ollama run qwen3:1.7b # Lightweight
ollama run qwen3:4b # Good balance
ollama run qwen3:14b # Strong reasoning
ollama run qwen3:32b # High quality (needs 24GB+ VRAM)
ollama run qwen3:30b-a3b # MoE - fast with good quality
# Specific quantizations
ollama run qwen3:8b-q4_K_M # 4-bit quantized (smaller, faster)
ollama run qwen3:8b-q8_0 # 8-bit quantized (better quality)
ollama run qwen3:8b-fp16 # Full precision (best quality, most VRAM)
Once the model is downloaded, you will see an interactive prompt where you can start chatting.
Use as an API
Ollama runs a local API server on port 11434:
# Start the server (runs automatically on install)
ollama serve
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [
{"role": "user", "content": "Write a Python function to merge two sorted lists"}
]
}'
Use with Python
# Using the OpenAI Python library (Ollama is OpenAI-compatible)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Any string works
)
response = client.chat.completions.create(
model="qwen3:8b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain the difference between async and threading in Python"}
]
)
print(response.choices[0].message.content)
Connect to Code Editors
Ollama integrates with AI code editors:
Cursor:
- Go to Settings > Models.
- Add OpenAI-compatible model.
- Set base URL to
http://localhost:11434/v1. - Set model name to
qwen3:8b.
Continue.dev (VS Code):
// ~/.continue/config.json
{
"models": [
{
"title": "Qwen 3 8B (Local)",
"provider": "ollama",
"model": "qwen3:8b"
}
]
}
Claude Code:
# Use as a custom provider (experimental)
export CLAUDE_MODEL="qwen3:8b"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
Method 2: llama.cpp (Maximum Performance)
For maximum control and performance, use llama.cpp directly. It supports CPU, CUDA, Metal, and Vulkan acceleration.
Install llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# Build with Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
# Build CPU-only
cmake -B build
cmake --build build --config Release -j
Download GGUF Models
GGUF is the optimized model format for llama.cpp. Download from Hugging Face:
# Install huggingface-cli
pip install huggingface_hub
# Download Qwen 3 8B in Q4_K_M quantization
huggingface-cli download Qwen/Qwen3-8B-GGUF \
qwen3-8b-q4_k_m.gguf \
--local-dir ./models/
Run the Model
# Interactive chat
./build/bin/llama-cli \
-m models/qwen3-8b-q4_k_m.gguf \
-ngl 99 \
--chat-template chatml \
-c 8192 \
-cnv
# Start an API server
./build/bin/llama-server \
-m models/qwen3-8b-q4_k_m.gguf \
-ngl 99 \
-c 8192 \
--host 0.0.0.0 \
--port 8080
| Flag | Description |
|---|---|
-m |
Path to GGUF model file |
-ngl 99 |
Offload all layers to GPU |
-c 8192 |
Context length (adjust based on RAM/VRAM) |
-cnv |
Enable conversation mode |
--chat-template chatml |
Use ChatML template (Qwen format) |
-t 8 |
Number of CPU threads |
Quantization Comparison
| Quantization | Size (8B model) | Quality | Speed | VRAM |
|---|---|---|---|---|
| Q2_K | ~3 GB | Low | Fastest | Least |
| Q4_K_M | ~5 GB | Good | Fast | Low |
| Q5_K_M | ~6 GB | Very Good | Medium | Medium |
| Q6_K | ~7 GB | Excellent | Medium | Medium |
| Q8_0 | ~9 GB | Near-Perfect | Slower | More |
| FP16 | ~16 GB | Perfect | Slowest | Most |
Recommendation: Q4_K_M is the best balance of quality and performance for most users. Use Q6_K or Q8_0 if you have the VRAM to spare.
Method 3: vLLM (Production Serving)
For high-throughput production serving with batching and paged attention, use vLLM:
# Install vLLM
pip install vllm
# Serve Qwen 3 8B
vllm serve Qwen/Qwen3-8B \
--dtype auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# Serve the MoE model
vllm serve Qwen/Qwen3-30B-A3B \
--dtype auto \
--max-model-len 8192 \
--trust-remote-code
vLLM provides an OpenAI-compatible API on port 8000:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[{"role": "user", "content": "Hello!"}]
)
Hardware Requirements
NVIDIA GPUs
| GPU | VRAM | Best Qwen 3 Model |
|---|---|---|
| RTX 3060 | 12 GB | 8B (Q4) or 30B-A3B (Q4) |
| RTX 3090 | 24 GB | 14B (Q8) or 32B (Q4) |
| RTX 4070 Ti | 12 GB | 8B (Q4) or 30B-A3B (Q4) |
| RTX 4080 | 16 GB | 14B (Q4) or 8B (Q8) |
| RTX 4090 | 24 GB | 32B (Q4) or 14B (FP16) |
| RTX 5090 | 32 GB | 32B (Q6) |
| A100 | 80 GB | 235B-A22B (Q4) |
Apple Silicon
| Mac | RAM | Best Qwen 3 Model |
|---|---|---|
| M1/M2 (8 GB) | 8 GB | 4B (Q4) or 0.6B |
| M1/M2 (16 GB) | 16 GB | 8B (Q4) or 30B-A3B (Q4) |
| M1/M2 Pro (32 GB) | 32 GB | 14B (Q6) or 32B (Q4) |
| M1/M2 Max (64 GB) | 64 GB | 32B (Q8) |
| M1/M2 Ultra (128 GB) | 128 GB | 235B-A22B (Q4) |
| M3/M4 series | Same as above | Same, slightly faster |
Apple Silicon uses unified memory, so all system RAM is available for the model. This makes Macs with large RAM surprisingly capable for running LLMs.
Performance Optimization Tips
1. Use the Right Context Length
Longer context uses more memory and slows inference. Set the context length to what you actually need:
# For simple Q&A (short context is fine)
ollama run qwen3:8b --ctx-size 4096
# For code analysis (need more context)
ollama run qwen3:8b --ctx-size 16384
# For long documents (maximum context)
ollama run qwen3:8b --ctx-size 32768
2. Enable Flash Attention
Flash attention reduces memory usage and speeds up inference:
# Ollama enables this automatically
# llama.cpp: add the -fa flag
./build/bin/llama-server -m model.gguf -ngl 99 -fa
3. Use KV Cache Quantization
Reduces memory usage for long contexts:
# llama.cpp: quantize the KV cache
./build/bin/llama-server \
-m model.gguf \
-ngl 99 \
--cache-type-k q4_0 \
--cache-type-v q4_0
4. Try the MoE Model First
If you are unsure about hardware, start with Qwen3-30B-A3B. It runs on the same hardware as an 8B model but performs closer to a 14-32B model:
ollama run qwen3:30b-a3b
Qwen 3 Thinking Mode
Qwen 3 supports a "thinking" mode similar to OpenAI's o1 model, where it reasons step-by-step before answering:
# Enable thinking mode in Ollama
ollama run qwen3:8b
> /set parameter num_ctx 8192
> Think step by step: What is the probability of rolling at least one 6
in four rolls of a fair die?
To toggle thinking mode programmatically:
response = client.chat.completions.create(
model="qwen3:8b",
messages=[
{
"role": "user",
"content": "Think step by step: Solve this optimization problem..."
}
],
extra_body={
"enable_thinking": True
}
)
Thinking mode produces better results for math, logic, and complex reasoning tasks but uses more tokens and takes longer.
Frequently Asked Questions
Which Qwen 3 model should I start with? Qwen3-8B (Q4_K_M quantization) for most users. If you have less than 8 GB VRAM, try Qwen3-30B-A3B which activates only 3B parameters per token.
How does Qwen 3 compare to Llama 3? Qwen 3 is competitive with or outperforms Meta's Llama 3.3 70B in many benchmarks, especially in multilingual tasks, coding, and math. The MoE variants offer better quality-per-FLOP.
Can I fine-tune Qwen 3 locally? Yes. Use tools like Unsloth, Axolotl, or LLaMA-Factory for LoRA fine-tuning. An 8B model can be fine-tuned on a single GPU with 16 GB VRAM using QLoRA.
Is Qwen 3 censored? Qwen 3 has safety alignment but is less restrictive than commercial models. The open-source nature means community can create uncensored variants, though these come with ethical considerations.
Does Qwen 3 support function calling / tool use? Yes. Qwen 3 supports structured tool use in the same format as OpenAI's function calling. This works in both Ollama and vLLM.
Can I use Qwen 3 commercially? Yes. Qwen 3 is released under the Apache 2.0 license, which allows commercial use without restrictions.
Wrapping Up
Running Qwen 3 locally gives you a frontier-class AI model with zero ongoing costs and complete data privacy. The combination of Ollama's simplicity, the MoE variants' efficiency, and the model's strong performance across coding, math, and general tasks makes Qwen 3 one of the best open-source models to run locally in 2026.
Start with ollama run qwen3:8b, experiment with the MoE variant if you want better quality-per-VRAM, and scale up to larger models as your hardware allows.
If your projects need AI-generated images, videos, or avatars alongside local LLM capabilities, try Hypereal AI free -- no credit card required. It handles the media generation that local LLMs cannot do on consumer hardware.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
