How to Download and Use Ollama: Step-by-Step (2026)
Run powerful AI models locally on your own machine
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Download and Use Ollama: Step-by-Step Guide (2026)
Ollama is the easiest way to run large language models locally on your own computer. Instead of paying for API calls or relying on cloud services, Ollama lets you download and run models like Llama 4, Qwen 3, DeepSeek, Gemma, and Phi directly on your machine with a single command.
This guide covers everything from installation to running your first model, managing multiple models, using the API, and optimizing performance.
Why Run Models Locally?
| Benefit | Description |
|---|---|
| Privacy | Your data never leaves your machine |
| No API costs | Unlimited usage after download |
| Offline access | Works without internet |
| No rate limits | No throttling or quotas |
| Customization | Run fine-tuned and custom models |
| Speed | No network latency for local inference |
The trade-off is that you need a computer with enough RAM and (ideally) a GPU. But modern quantized models run surprisingly well on consumer hardware.
Hardware Requirements
| Model Size | RAM Needed | GPU VRAM | Example Models |
|---|---|---|---|
| 1-3B | 4GB | 2GB+ | Phi-4 Mini, Gemma 3 1B |
| 7-8B | 8GB | 6GB+ | Llama 3.1 8B, Qwen 3 8B |
| 14B | 16GB | 10GB+ | Qwen 3 14B, Gemma 3 12B |
| 32-34B | 32GB | 24GB+ | Qwen 3 32B, DeepSeek Coder 33B |
| 70B | 48GB+ | 48GB+ | Llama 3.1 70B |
Ollama can run on CPU only (slower) or use GPU acceleration with NVIDIA, AMD, or Apple Silicon GPUs. Apple Silicon Macs with unified memory are particularly good for running larger models.
Step 1: Download and Install Ollama
macOS
# Option 1: Download from the website
# Go to https://ollama.com/download and download the macOS app
# Option 2: Install via Homebrew
brew install ollama
The macOS app installs Ollama as a menu bar application that runs the server in the background.
Windows
- Go to ollama.com/download.
- Download the Windows installer.
- Run the installer and follow the prompts.
- Ollama runs as a system service after installation.
Linux
# One-line install script
curl -fsSL https://ollama.ai/install.sh | sh
# Or install manually
# Download the binary for your architecture from GitHub releases
Verify Installation
ollama --version
# Output: ollama version 0.6.x
Step 2: Download Your First Model
Ollama's model library has hundreds of models. Start by pulling a model:
# Download Llama 3.1 8B (4.7GB)
ollama pull llama3.1
# Download Qwen 3 8B (4.9GB)
ollama pull qwen3
# Download a smaller model for testing (1.6GB)
ollama pull phi4-mini
The download happens once. After that, the model loads from your local storage.
Step 3: Chat with a Model
Start an interactive chat session:
ollama run llama3.1
This opens a REPL where you can type messages:
>>> What is the capital of France?
The capital of France is Paris. It is the largest city in France and serves as
the country's political, economic, and cultural center.
>>> Write a Python function to reverse a string
Here's a simple Python function to reverse a string:
def reverse_string(s):
return s[::-1]
# Example usage
print(reverse_string("hello")) # Output: "olleh"
>>> /bye
Use /bye to exit the chat.
Step 4: Use the REST API
Ollama runs a local API server at http://localhost:11434. This is useful for building applications:
Chat Completion
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.1",
"messages": [
{"role": "user", "content": "Explain Docker in 3 sentences."}
]
}'
OpenAI-Compatible Endpoint
Ollama also exposes an OpenAI-compatible endpoint, so you can use it with any OpenAI SDK:
import openai
client = openai.OpenAI(
api_key="ollama", # Any value works
base_url="http://localhost:11434/v1"
)
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How do I center a div in CSS?"}
],
temperature=0.7
)
print(response.choices[0].message.content)
Streaming Responses
stream = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "user", "content": "Write a haiku about programming."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Step 5: Manage Models
List Downloaded Models
ollama list
# Output:
# NAME SIZE MODIFIED
# llama3.1:latest 4.7 GB 2 hours ago
# qwen3:latest 4.9 GB 1 hour ago
# phi4-mini:latest 1.6 GB 30 minutes ago
Remove a Model
ollama rm phi4-mini
Pull a Specific Size Variant
Many models come in multiple sizes:
# Smaller quantization (faster, less accurate)
ollama pull llama3.1:8b-q4_0
# Larger quantization (slower, more accurate)
ollama pull llama3.1:8b-q8_0
# Specific parameter count
ollama pull qwen3:14b
ollama pull qwen3:32b
Check Model Info
ollama show llama3.1
# Shows model details: parameters, quantization, template, license, etc.
Step 6: Create Custom Models with Modelfile
A Modelfile lets you customize a model's behavior:
# Modelfile
FROM llama3.1
# Set a custom system prompt
SYSTEM """You are a senior software engineer. You write clean, well-documented
code with proper error handling. Always explain your reasoning before showing code."""
# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Build and run your custom model:
# Create the model
ollama create my-coder -f Modelfile
# Run it
ollama run my-coder
Step 7: Use Ollama with Popular Tools
Ollama integrates with many AI tools:
With Cursor
In Cursor settings, add Ollama as a custom model provider:
Base URL: http://localhost:11434/v1
API Key: ollama
Model: llama3.1
With Continue.dev (VS Code)
// ~/.continue/config.json
{
"models": [
{
"title": "Ollama - Llama 3.1",
"provider": "ollama",
"model": "llama3.1"
}
],
"tabAutocompleteModel": {
"title": "Ollama - Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
With Open WebUI (ChatGPT-like Interface)
docker run -d \
-p 3000:8080 \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 for a ChatGPT-like web interface connected to your local Ollama models.
Recommended Models for 2026
| Model | Size | Best For | Command |
|---|---|---|---|
| Llama 3.1 8B | 4.7GB | General purpose | ollama pull llama3.1 |
| Qwen 3 8B | 4.9GB | Coding + reasoning | ollama pull qwen3 |
| DeepSeek Coder V2 | 8.9GB | Code generation | ollama pull deepseek-coder-v2 |
| Gemma 3 12B | 8.1GB | Instruction following | ollama pull gemma3:12b |
| Phi-4 Mini | 1.6GB | Low-resource machines | ollama pull phi4-mini |
| Mistral Nemo | 7.1GB | Multilingual | ollama pull mistral-nemo |
| Qwen 2.5 Coder 7B | 4.7GB | Code autocomplete | ollama pull qwen2.5-coder:7b |
| Llama 3.1 70B | 40GB | Maximum quality | ollama pull llama3.1:70b |
Performance Tips
Use GPU acceleration. Ollama automatically detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon GPUs. Verify with:
ollama ps # Shows which models are loaded and whether they use GPUAdjust context size. Larger context windows use more memory. Set
num_ctxin your Modelfile or API call to match your needs.Keep models loaded. Ollama keeps the most recently used model in memory. Avoid switching between models frequently.
Use quantized models. Q4 quantizations offer the best balance of speed and quality for most use cases.
Close other GPU-intensive apps. Video editors, games, and other AI tools compete for GPU memory.
Frequently Asked Questions
Is Ollama free? Yes, Ollama is completely free and open source (MIT license). You only need a computer capable of running the models.
Can I use Ollama offline? Yes. Once you download a model, everything runs locally with no internet required.
What GPU do I need? For 7-8B models, any GPU with 6GB+ VRAM works. Apple Silicon Macs work particularly well due to unified memory. You can also run on CPU only (slower).
How does Ollama compare to LM Studio? Both run local models. Ollama is CLI-first with a REST API, making it better for developers and integrations. LM Studio has a graphical interface, making it better for non-technical users.
Can I run multiple models simultaneously? Yes, if you have enough memory. Ollama loads models on demand and can keep multiple models in memory.
Does Ollama support vision models?
Yes. Models like llava and llama3.2-vision support image inputs.
Wrapping Up
Ollama makes running local AI models as simple as a single command. Whether you want complete privacy, zero API costs, or offline access, it is the best tool for local LLM inference in 2026. Start with a 7-8B model, explore the API for building applications, and scale up to larger models as your hardware allows.
If you are building applications that need AI-generated media like images, video, or talking avatars, try Hypereal AI free -- 35 credits, no credit card required. Combine local LLMs for text intelligence with Hypereal's API for visual content generation.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
