How to Use Ollama: Complete Beginner's Guide (2026)

Ollama has become the de facto standard for running large language models locally. If you want to use AI models on your own hardware -- without sending data to cloud APIs, paying per-token fees, or dealing with rate limits -- Ollama is the tool you need. It simplifies the process of downloading, managing, and running open-source LLMs down to a few terminal commands.

This guide covers everything from installation to advanced usage, including model management, API integration, customization, and performance optimization.

What Is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally on macOS, Linux, and Windows. It handles model downloading, quantization, GPU acceleration, and provides a simple API that is compatible with the OpenAI API format -- meaning you can swap it into most existing AI applications with minimal code changes.

Think of it as "Docker for LLMs": you pull a model, run it, and interact with it through a clean command-line interface or HTTP API.

System Requirements

Before installing, make sure your system meets the minimum requirements:

Component	Minimum	Recommended
RAM	8 GB	16+ GB
Storage	10 GB free	50+ GB (models are large)
GPU (optional)	Any NVIDIA GPU with 4+ GB VRAM	NVIDIA RTX 3060+ (12 GB VRAM) or Apple Silicon
OS	macOS 12+, Ubuntu 20.04+, Windows 10+	Latest stable OS version

Ollama runs on CPU if you do not have a GPU, but inference will be significantly slower.

Step 1: Install Ollama

macOS

# Option 1: Download from the website
# Visit https://ollama.com and download the macOS installer

# Option 2: Using Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com and run it. Ollama runs as a system service on Windows.

Verify Installation

ollama --version
# Expected output: ollama version 0.5.x

Step 2: Pull and Run Your First Model

Ollama uses a Docker-like pull/run workflow:

# Pull a model (downloads it to your machine)
ollama pull llama3.2

# Run the model interactively
ollama run llama3.2

This drops you into an interactive chat session. Type your message and press Enter to get a response. Type /bye to exit.

Recommended Starter Models

Here is a comparison of popular models and their resource requirements:

Model	Parameters	RAM Required	VRAM Required	Best For
llama3.2:3b	3B	4 GB	3 GB	Quick tasks, low-resource machines
llama3.2	8B	8 GB	6 GB	General purpose, good balance
llama3.1:70b	70B	48 GB	40 GB	Complex reasoning, high-end hardware
mistral	7B	8 GB	5 GB	Fast, good at following instructions
gemma2:9b	9B	8 GB	6 GB	Google's open model, strong reasoning
codellama	7B	8 GB	5 GB	Code generation and analysis
deepseek-coder-v2	16B	12 GB	10 GB	Advanced coding tasks
phi3:mini	3.8B	4 GB	3 GB	Surprisingly capable for its size
qwen2.5:7b	7B	8 GB	5 GB	Multilingual, strong coding

To pull any of these:

ollama pull mistral
ollama pull codellama
ollama pull gemma2:9b

Step 3: Model Management

List Downloaded Models

ollama list

Output:

NAME                ID            SIZE      MODIFIED
llama3.2:latest     a80c4f17acd5  4.7 GB    2 minutes ago
mistral:latest      2ae6f6dd7a3d  4.1 GB    5 minutes ago
codellama:latest    8fdf8f752f6e  3.8 GB    10 minutes ago

Remove a Model

ollama rm codellama

Show Model Details

ollama show llama3.2

Copy/Rename a Model

ollama cp llama3.2 my-custom-llama

Step 4: Use the Ollama API

Ollama runs an HTTP server on localhost:11434 by default. The API is compatible with the OpenAI format, making integration straightforward.

Basic API Call

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain the difference between REST and GraphQL in 3 sentences.",
  "stream": false
}'

Chat API (Multi-turn)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to validate an email address."}
  ],
  "stream": false
}'

Using with Python

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.2",
    "prompt": "Write a bash script to backup a PostgreSQL database.",
    "stream": False
})

print(response.json()["response"])

Using with the OpenAI Python SDK

Since Ollama's API is OpenAI-compatible, you can use the official OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any string works
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a senior Python developer."},
        {"role": "user", "content": "Write a thread-safe singleton pattern in Python."}
    ]
)

print(response.choices[0].message.content)

Step 5: Create Custom Models with Modelfiles

Ollama lets you create custom model configurations using Modelfiles (similar to Dockerfiles):

# Save as Modelfile
FROM llama3.2

# Set the system prompt
SYSTEM """
You are a senior full-stack developer specializing in TypeScript, React, and Node.js.
Always provide production-ready code with error handling and TypeScript types.
When asked about architecture decisions, explain the trade-offs.
"""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run your custom model:

ollama create my-dev-assistant -f Modelfile
ollama run my-dev-assistant

Step 6: GPU Acceleration

NVIDIA GPUs

Ollama automatically detects NVIDIA GPUs if you have the CUDA drivers installed:

# Check if GPU is being used
ollama ps

Apple Silicon (M1/M2/M3/M4)

Ollama uses Metal acceleration on Apple Silicon automatically. No additional configuration is needed. Apple Silicon Macs with unified memory are particularly well-suited for running LLMs because the GPU can access the full system RAM.

Splitting Models Across GPU and CPU

For models that are too large for your GPU VRAM, Ollama automatically splits the model between GPU and CPU:

# Set the number of GPU layers manually
OLLAMA_NUM_GPU=20 ollama run llama3.1:70b

Performance Tips

1. Use Quantized Models

Quantized models use less memory and run faster with minimal quality loss:

# Q4 quantization (good balance of speed and quality)
ollama pull llama3.2:8b-instruct-q4_K_M

# Q8 quantization (higher quality, more memory)
ollama pull llama3.2:8b-instruct-q8_0

2. Increase Context Window

# Set context window via environment variable
OLLAMA_NUM_CTX=16384 ollama run llama3.2

3. Keep Models Loaded

By default, Ollama unloads models after 5 minutes of inactivity. Change this:

# Keep model loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve

4. Run Multiple Models

Ollama can serve multiple models simultaneously if you have enough RAM:

# In separate terminals
ollama run llama3.2      # General tasks
ollama run codellama     # Coding tasks

Common Issues and Fixes

Problem	Solution
"model not found"	Run `ollama pull model-name` first
Slow inference on GPU	Update GPU drivers; check `ollama ps` for GPU usage
Out of memory	Use a smaller model or quantized variant
Port 11434 already in use	Stop existing Ollama instance: `ollama stop`
Model downloading slowly	Check internet connection; Ollama CDN may be congested

Conclusion

Ollama makes running LLMs locally as simple as pulling and running a Docker container. Whether you need privacy, want to avoid API costs, or just want to experiment with open-source models, Ollama is the most straightforward way to get started in 2026.

For projects that need both local AI inference and high-quality media generation, consider pairing Ollama with Hypereal AI. Use Ollama for private, cost-free text generation and Hypereal AI's affordable API for generating images, videos, AI avatars, and voice content -- giving you a complete AI toolkit without breaking the bank.