How to Use Ollama: Complete Beginner's Guide (2026)
Run powerful LLMs locally on your own machine
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Use Ollama: Complete Beginner's Guide (2026)
Ollama has become the de facto standard for running large language models locally. If you want to use AI models on your own hardware -- without sending data to cloud APIs, paying per-token fees, or dealing with rate limits -- Ollama is the tool you need. It simplifies the process of downloading, managing, and running open-source LLMs down to a few terminal commands.
This guide covers everything from installation to advanced usage, including model management, API integration, customization, and performance optimization.
What Is Ollama?
Ollama is an open-source tool that makes it easy to run large language models locally on macOS, Linux, and Windows. It handles model downloading, quantization, GPU acceleration, and provides a simple API that is compatible with the OpenAI API format -- meaning you can swap it into most existing AI applications with minimal code changes.
Think of it as "Docker for LLMs": you pull a model, run it, and interact with it through a clean command-line interface or HTTP API.
System Requirements
Before installing, make sure your system meets the minimum requirements:
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16+ GB |
| Storage | 10 GB free | 50+ GB (models are large) |
| GPU (optional) | Any NVIDIA GPU with 4+ GB VRAM | NVIDIA RTX 3060+ (12 GB VRAM) or Apple Silicon |
| OS | macOS 12+, Ubuntu 20.04+, Windows 10+ | Latest stable OS version |
Ollama runs on CPU if you do not have a GPU, but inference will be significantly slower.
Step 1: Install Ollama
macOS
# Option 1: Download from the website
# Visit https://ollama.com and download the macOS installer
# Option 2: Using Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com and run it. Ollama runs as a system service on Windows.
Verify Installation
ollama --version
# Expected output: ollama version 0.5.x
Step 2: Pull and Run Your First Model
Ollama uses a Docker-like pull/run workflow:
# Pull a model (downloads it to your machine)
ollama pull llama3.2
# Run the model interactively
ollama run llama3.2
This drops you into an interactive chat session. Type your message and press Enter to get a response. Type /bye to exit.
Recommended Starter Models
Here is a comparison of popular models and their resource requirements:
| Model | Parameters | RAM Required | VRAM Required | Best For |
|---|---|---|---|---|
| llama3.2:3b | 3B | 4 GB | 3 GB | Quick tasks, low-resource machines |
| llama3.2 | 8B | 8 GB | 6 GB | General purpose, good balance |
| llama3.1:70b | 70B | 48 GB | 40 GB | Complex reasoning, high-end hardware |
| mistral | 7B | 8 GB | 5 GB | Fast, good at following instructions |
| gemma2:9b | 9B | 8 GB | 6 GB | Google's open model, strong reasoning |
| codellama | 7B | 8 GB | 5 GB | Code generation and analysis |
| deepseek-coder-v2 | 16B | 12 GB | 10 GB | Advanced coding tasks |
| phi3:mini | 3.8B | 4 GB | 3 GB | Surprisingly capable for its size |
| qwen2.5:7b | 7B | 8 GB | 5 GB | Multilingual, strong coding |
To pull any of these:
ollama pull mistral
ollama pull codellama
ollama pull gemma2:9b
Step 3: Model Management
List Downloaded Models
ollama list
Output:
NAME ID SIZE MODIFIED
llama3.2:latest a80c4f17acd5 4.7 GB 2 minutes ago
mistral:latest 2ae6f6dd7a3d 4.1 GB 5 minutes ago
codellama:latest 8fdf8f752f6e 3.8 GB 10 minutes ago
Remove a Model
ollama rm codellama
Show Model Details
ollama show llama3.2
Copy/Rename a Model
ollama cp llama3.2 my-custom-llama
Step 4: Use the Ollama API
Ollama runs an HTTP server on localhost:11434 by default. The API is compatible with the OpenAI format, making integration straightforward.
Basic API Call
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain the difference between REST and GraphQL in 3 sentences.",
"stream": false
}'
Chat API (Multi-turn)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to validate an email address."}
],
"stream": false
}'
Using with Python
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.2",
"prompt": "Write a bash script to backup a PostgreSQL database.",
"stream": False
})
print(response.json()["response"])
Using with the OpenAI Python SDK
Since Ollama's API is OpenAI-compatible, you can use the official OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Any string works
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a senior Python developer."},
{"role": "user", "content": "Write a thread-safe singleton pattern in Python."}
]
)
print(response.choices[0].message.content)
Step 5: Create Custom Models with Modelfiles
Ollama lets you create custom model configurations using Modelfiles (similar to Dockerfiles):
# Save as Modelfile
FROM llama3.2
# Set the system prompt
SYSTEM """
You are a senior full-stack developer specializing in TypeScript, React, and Node.js.
Always provide production-ready code with error handling and TypeScript types.
When asked about architecture decisions, explain the trade-offs.
"""
# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Build and run your custom model:
ollama create my-dev-assistant -f Modelfile
ollama run my-dev-assistant
Step 6: GPU Acceleration
NVIDIA GPUs
Ollama automatically detects NVIDIA GPUs if you have the CUDA drivers installed:
# Check if GPU is being used
ollama ps
Apple Silicon (M1/M2/M3/M4)
Ollama uses Metal acceleration on Apple Silicon automatically. No additional configuration is needed. Apple Silicon Macs with unified memory are particularly well-suited for running LLMs because the GPU can access the full system RAM.
Splitting Models Across GPU and CPU
For models that are too large for your GPU VRAM, Ollama automatically splits the model between GPU and CPU:
# Set the number of GPU layers manually
OLLAMA_NUM_GPU=20 ollama run llama3.1:70b
Performance Tips
1. Use Quantized Models
Quantized models use less memory and run faster with minimal quality loss:
# Q4 quantization (good balance of speed and quality)
ollama pull llama3.2:8b-instruct-q4_K_M
# Q8 quantization (higher quality, more memory)
ollama pull llama3.2:8b-instruct-q8_0
2. Increase Context Window
# Set context window via environment variable
OLLAMA_NUM_CTX=16384 ollama run llama3.2
3. Keep Models Loaded
By default, Ollama unloads models after 5 minutes of inactivity. Change this:
# Keep model loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve
4. Run Multiple Models
Ollama can serve multiple models simultaneously if you have enough RAM:
# In separate terminals
ollama run llama3.2 # General tasks
ollama run codellama # Coding tasks
Common Issues and Fixes
| Problem | Solution |
|---|---|
| "model not found" | Run ollama pull model-name first |
| Slow inference on GPU | Update GPU drivers; check ollama ps for GPU usage |
| Out of memory | Use a smaller model or quantized variant |
| Port 11434 already in use | Stop existing Ollama instance: ollama stop |
| Model downloading slowly | Check internet connection; Ollama CDN may be congested |
Conclusion
Ollama makes running LLMs locally as simple as pulling and running a Docker container. Whether you need privacy, want to avoid API costs, or just want to experiment with open-source models, Ollama is the most straightforward way to get started in 2026.
For projects that need both local AI inference and high-quality media generation, consider pairing Ollama with Hypereal AI. Use Ollama for private, cost-free text generation and Hypereal AI's affordable API for generating images, videos, AI avatars, and voice content -- giving you a complete AI toolkit without breaking the bank.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
