How to Run Qwen 3 Locally: Complete Guide (2026)

Qwen 3 is Alibaba's latest open-source large language model family and one of the strongest open-weight models available in 2026. It comes in multiple sizes, supports both dense and mixture-of-experts (MoE) architectures, and performs competitively with proprietary models like GPT-4o and Claude Sonnet on many benchmarks.

The best part: you can run it entirely on your own hardware with no API costs, no rate limits, and complete data privacy. This guide shows you how, step by step.

Qwen 3 Model Lineup

Qwen 3 comes in several sizes to fit different hardware:

Model	Parameters	Active Params	Architecture	Min VRAM	Best For
Qwen3-0.6B	0.6B	0.6B	Dense	2 GB	Edge devices, mobile
Qwen3-1.7B	1.7B	1.7B	Dense	4 GB	Lightweight tasks
Qwen3-4B	4B	4B	Dense	6 GB	Balanced performance
Qwen3-8B	8B	8B	Dense	8 GB	General use
Qwen3-14B	14B	14B	Dense	12 GB	Strong reasoning
Qwen3-32B	32B	32B	Dense	24 GB	Near-frontier quality
Qwen3-30B-A3B	30B	3B	MoE	6 GB	Fast, efficient
Qwen3-235B-A22B	235B	22B	MoE	48 GB+	Frontier-class

The MoE (Mixture of Experts) models are particularly interesting. Qwen3-30B-A3B has 30 billion total parameters but only activates 3 billion per token, making it fast and memory-efficient while maintaining high quality.

Method 1: Ollama (Easiest)

Ollama is the simplest way to run LLMs locally. It handles model downloading, quantization, and serving with a single command.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS (Homebrew)
brew install ollama

# Windows: download from ollama.ai

Download and Run Qwen 3

# Pull and run Qwen 3 8B (recommended starting point)
ollama run qwen3:8b

# Other sizes
ollama run qwen3:0.6b    # Very small, fast
ollama run qwen3:1.7b    # Lightweight
ollama run qwen3:4b      # Good balance
ollama run qwen3:14b     # Strong reasoning
ollama run qwen3:32b     # High quality (needs 24GB+ VRAM)
ollama run qwen3:30b-a3b # MoE - fast with good quality

# Specific quantizations
ollama run qwen3:8b-q4_K_M   # 4-bit quantized (smaller, faster)
ollama run qwen3:8b-q8_0     # 8-bit quantized (better quality)
ollama run qwen3:8b-fp16     # Full precision (best quality, most VRAM)

Once the model is downloaded, you will see an interactive prompt where you can start chatting.

Use as an API

Ollama runs a local API server on port 11434:

# Start the server (runs automatically on install)
ollama serve

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [
      {"role": "user", "content": "Write a Python function to merge two sorted lists"}
    ]
  }'

Use with Python

# Using the OpenAI Python library (Ollama is OpenAI-compatible)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any string works
)

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain the difference between async and threading in Python"}
    ]
)

print(response.choices[0].message.content)

Connect to Code Editors

Ollama integrates with AI code editors:

Cursor:

Go to Settings > Models.
Add OpenAI-compatible model.
Set base URL to http://localhost:11434/v1.
Set model name to qwen3:8b.

Continue.dev (VS Code):

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen 3 8B (Local)",
      "provider": "ollama",
      "model": "qwen3:8b"
    }
  ]
}

Claude Code:

# Use as a custom provider (experimental)
export CLAUDE_MODEL="qwen3:8b"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"

Method 2: llama.cpp (Maximum Performance)

For maximum control and performance, use llama.cpp directly. It supports CPU, CUDA, Metal, and Vulkan acceleration.

Install llama.cpp

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Build with Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

# Build CPU-only
cmake -B build
cmake --build build --config Release -j

Download GGUF Models

GGUF is the optimized model format for llama.cpp. Download from Hugging Face:

# Install huggingface-cli
pip install huggingface_hub

# Download Qwen 3 8B in Q4_K_M quantization
huggingface-cli download Qwen/Qwen3-8B-GGUF \
  qwen3-8b-q4_k_m.gguf \
  --local-dir ./models/

Run the Model

# Interactive chat
./build/bin/llama-cli \
  -m models/qwen3-8b-q4_k_m.gguf \
  -ngl 99 \
  --chat-template chatml \
  -c 8192 \
  -cnv

# Start an API server
./build/bin/llama-server \
  -m models/qwen3-8b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

Flag	Description
`-m`	Path to GGUF model file
`-ngl 99`	Offload all layers to GPU
`-c 8192`	Context length (adjust based on RAM/VRAM)
`-cnv`	Enable conversation mode
`--chat-template chatml`	Use ChatML template (Qwen format)
`-t 8`	Number of CPU threads

Quantization Comparison

Quantization	Size (8B model)	Quality	Speed	VRAM
Q2_K	~3 GB	Low	Fastest	Least
Q4_K_M	~5 GB	Good	Fast	Low
Q5_K_M	~6 GB	Very Good	Medium	Medium
Q6_K	~7 GB	Excellent	Medium	Medium
Q8_0	~9 GB	Near-Perfect	Slower	More
FP16	~16 GB	Perfect	Slowest	Most

Recommendation: Q4_K_M is the best balance of quality and performance for most users. Use Q6_K or Q8_0 if you have the VRAM to spare.

Method 3: vLLM (Production Serving)

For high-throughput production serving with batching and paged attention, use vLLM:

# Install vLLM
pip install vllm

# Serve Qwen 3 8B
vllm serve Qwen/Qwen3-8B \
  --dtype auto \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

# Serve the MoE model
vllm serve Qwen/Qwen3-30B-A3B \
  --dtype auto \
  --max-model-len 8192 \
  --trust-remote-code

vLLM provides an OpenAI-compatible API on port 8000:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Hello!"}]
)

Hardware Requirements

NVIDIA GPUs

GPU	VRAM	Best Qwen 3 Model
RTX 3060	12 GB	8B (Q4) or 30B-A3B (Q4)
RTX 3090	24 GB	14B (Q8) or 32B (Q4)
RTX 4070 Ti	12 GB	8B (Q4) or 30B-A3B (Q4)
RTX 4080	16 GB	14B (Q4) or 8B (Q8)
RTX 4090	24 GB	32B (Q4) or 14B (FP16)
RTX 5090	32 GB	32B (Q6)
A100	80 GB	235B-A22B (Q4)

Apple Silicon

Mac	RAM	Best Qwen 3 Model
M1/M2 (8 GB)	8 GB	4B (Q4) or 0.6B
M1/M2 (16 GB)	16 GB	8B (Q4) or 30B-A3B (Q4)
M1/M2 Pro (32 GB)	32 GB	14B (Q6) or 32B (Q4)
M1/M2 Max (64 GB)	64 GB	32B (Q8)
M1/M2 Ultra (128 GB)	128 GB	235B-A22B (Q4)
M3/M4 series	Same as above	Same, slightly faster

Apple Silicon uses unified memory, so all system RAM is available for the model. This makes Macs with large RAM surprisingly capable for running LLMs.

Performance Optimization Tips

1. Use the Right Context Length

Longer context uses more memory and slows inference. Set the context length to what you actually need:

# For simple Q&A (short context is fine)
ollama run qwen3:8b --ctx-size 4096

# For code analysis (need more context)
ollama run qwen3:8b --ctx-size 16384

# For long documents (maximum context)
ollama run qwen3:8b --ctx-size 32768

2. Enable Flash Attention

Flash attention reduces memory usage and speeds up inference:

# Ollama enables this automatically

# llama.cpp: add the -fa flag
./build/bin/llama-server -m model.gguf -ngl 99 -fa

3. Use KV Cache Quantization

Reduces memory usage for long contexts:

# llama.cpp: quantize the KV cache
./build/bin/llama-server \
  -m model.gguf \
  -ngl 99 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0

4. Try the MoE Model First

If you are unsure about hardware, start with Qwen3-30B-A3B. It runs on the same hardware as an 8B model but performs closer to a 14-32B model:

ollama run qwen3:30b-a3b

Qwen 3 Thinking Mode

Qwen 3 supports a "thinking" mode similar to OpenAI's o1 model, where it reasons step-by-step before answering:

# Enable thinking mode in Ollama
ollama run qwen3:8b

> /set parameter num_ctx 8192
> Think step by step: What is the probability of rolling at least one 6
  in four rolls of a fair die?

To toggle thinking mode programmatically:

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {
            "role": "user",
            "content": "Think step by step: Solve this optimization problem..."
        }
    ],
    extra_body={
        "enable_thinking": True
    }
)

Thinking mode produces better results for math, logic, and complex reasoning tasks but uses more tokens and takes longer.

Frequently Asked Questions

Which Qwen 3 model should I start with? Qwen3-8B (Q4_K_M quantization) for most users. If you have less than 8 GB VRAM, try Qwen3-30B-A3B which activates only 3B parameters per token.

How does Qwen 3 compare to Llama 3? Qwen 3 is competitive with or outperforms Meta's Llama 3.3 70B in many benchmarks, especially in multilingual tasks, coding, and math. The MoE variants offer better quality-per-FLOP.

Can I fine-tune Qwen 3 locally? Yes. Use tools like Unsloth, Axolotl, or LLaMA-Factory for LoRA fine-tuning. An 8B model can be fine-tuned on a single GPU with 16 GB VRAM using QLoRA.

Is Qwen 3 censored? Qwen 3 has safety alignment but is less restrictive than commercial models. The open-source nature means community can create uncensored variants, though these come with ethical considerations.

Does Qwen 3 support function calling / tool use? Yes. Qwen 3 supports structured tool use in the same format as OpenAI's function calling. This works in both Ollama and vLLM.

Can I use Qwen 3 commercially? Yes. Qwen 3 is released under the Apache 2.0 license, which allows commercial use without restrictions.

Wrapping Up

Running Qwen 3 locally gives you a frontier-class AI model with zero ongoing costs and complete data privacy. The combination of Ollama's simplicity, the MoE variants' efficiency, and the model's strong performance across coding, math, and general tasks makes Qwen 3 one of the best open-source models to run locally in 2026.

Start with ollama run qwen3:8b, experiment with the MoE variant if you want better quality-per-VRAM, and scale up to larger models as your hardware allows.

If your projects need AI-generated images, videos, or avatars alongside local LLM capabilities, try Hypereal AI free -- no credit card required. It handles the media generation that local LLMs cannot do on consumer hardware.