How to Download and Use Ollama: Step-by-Step (2026)

How to Download and Use Ollama: Step-by-Step Guide (2026)

Ollama is the easiest way to run large language models locally on your own computer. Instead of paying for API calls or relying on cloud services, Ollama lets you download and run models like Llama 4, Qwen 3, DeepSeek, Gemma, and Phi directly on your machine with a single command.

This guide covers everything from installation to running your first model, managing multiple models, using the API, and optimizing performance.

Why Run Models Locally?

Benefit	Description
Privacy	Your data never leaves your machine
No API costs	Unlimited usage after download
Offline access	Works without internet
No rate limits	No throttling or quotas
Customization	Run fine-tuned and custom models
Speed	No network latency for local inference

The trade-off is that you need a computer with enough RAM and (ideally) a GPU. But modern quantized models run surprisingly well on consumer hardware.

Hardware Requirements

Model Size	RAM Needed	GPU VRAM	Example Models
1-3B	4GB	2GB+	Phi-4 Mini, Gemma 3 1B
7-8B	8GB	6GB+	Llama 3.1 8B, Qwen 3 8B
14B	16GB	10GB+	Qwen 3 14B, Gemma 3 12B
32-34B	32GB	24GB+	Qwen 3 32B, DeepSeek Coder 33B
70B	48GB+	48GB+	Llama 3.1 70B

Ollama can run on CPU only (slower) or use GPU acceleration with NVIDIA, AMD, or Apple Silicon GPUs. Apple Silicon Macs with unified memory are particularly good for running larger models.

Step 1: Download and Install Ollama

macOS

# Option 1: Download from the website
# Go to https://ollama.com/download and download the macOS app

# Option 2: Install via Homebrew
brew install ollama

The macOS app installs Ollama as a menu bar application that runs the server in the background.

Windows

Go to ollama.com/download.
Download the Windows installer.
Run the installer and follow the prompts.
Ollama runs as a system service after installation.

Linux

# One-line install script
curl -fsSL https://ollama.ai/install.sh | sh

# Or install manually
# Download the binary for your architecture from GitHub releases

Verify Installation

ollama --version
# Output: ollama version 0.6.x

Step 2: Download Your First Model

Ollama's model library has hundreds of models. Start by pulling a model:

# Download Llama 3.1 8B (4.7GB)
ollama pull llama3.1

# Download Qwen 3 8B (4.9GB)
ollama pull qwen3

# Download a smaller model for testing (1.6GB)
ollama pull phi4-mini

The download happens once. After that, the model loads from your local storage.

Step 3: Chat with a Model

Start an interactive chat session:

ollama run llama3.1

This opens a REPL where you can type messages:

>>> What is the capital of France?
The capital of France is Paris. It is the largest city in France and serves as
the country's political, economic, and cultural center.

>>> Write a Python function to reverse a string
Here's a simple Python function to reverse a string:

def reverse_string(s):
    return s[::-1]

# Example usage
print(reverse_string("hello"))  # Output: "olleh"

>>> /bye

Use /bye to exit the chat.

Step 4: Use the REST API

Ollama runs a local API server at http://localhost:11434. This is useful for building applications:

Chat Completion

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain Docker in 3 sentences."}
    ]
  }'

OpenAI-Compatible Endpoint

Ollama also exposes an OpenAI-compatible endpoint, so you can use it with any OpenAI SDK:

import openai

client = openai.OpenAI(
    api_key="ollama",  # Any value works
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "How do I center a div in CSS?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about programming."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Step 5: Manage Models

List Downloaded Models

ollama list

# Output:
# NAME              SIZE     MODIFIED
# llama3.1:latest   4.7 GB   2 hours ago
# qwen3:latest      4.9 GB   1 hour ago
# phi4-mini:latest  1.6 GB   30 minutes ago

Remove a Model

ollama rm phi4-mini

Pull a Specific Size Variant

Many models come in multiple sizes:

# Smaller quantization (faster, less accurate)
ollama pull llama3.1:8b-q4_0

# Larger quantization (slower, more accurate)
ollama pull llama3.1:8b-q8_0

# Specific parameter count
ollama pull qwen3:14b
ollama pull qwen3:32b

Check Model Info

ollama show llama3.1

# Shows model details: parameters, quantization, template, license, etc.

Step 6: Create Custom Models with Modelfile

A Modelfile lets you customize a model's behavior:

# Modelfile
FROM llama3.1

# Set a custom system prompt
SYSTEM """You are a senior software engineer. You write clean, well-documented
code with proper error handling. Always explain your reasoning before showing code."""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run your custom model:

# Create the model
ollama create my-coder -f Modelfile

# Run it
ollama run my-coder

Step 7: Use Ollama with Popular Tools

Ollama integrates with many AI tools:

With Cursor

In Cursor settings, add Ollama as a custom model provider:

Base URL: http://localhost:11434/v1
API Key: ollama
Model: llama3.1

With Continue.dev (VS Code)

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Ollama - Llama 3.1",
      "provider": "ollama",
      "model": "llama3.1"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama - Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

With Open WebUI (ChatGPT-like Interface)

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 for a ChatGPT-like web interface connected to your local Ollama models.

Recommended Models for 2026

Model	Size	Best For	Command
Llama 3.1 8B	4.7GB	General purpose	`ollama pull llama3.1`
Qwen 3 8B	4.9GB	Coding + reasoning	`ollama pull qwen3`
DeepSeek Coder V2	8.9GB	Code generation	`ollama pull deepseek-coder-v2`
Gemma 3 12B	8.1GB	Instruction following	`ollama pull gemma3:12b`
Phi-4 Mini	1.6GB	Low-resource machines	`ollama pull phi4-mini`
Mistral Nemo	7.1GB	Multilingual	`ollama pull mistral-nemo`
Qwen 2.5 Coder 7B	4.7GB	Code autocomplete	`ollama pull qwen2.5-coder:7b`
Llama 3.1 70B	40GB	Maximum quality	`ollama pull llama3.1:70b`

Performance Tips

Use GPU acceleration. Ollama automatically detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon GPUs. Verify with:
```
ollama ps
# Shows which models are loaded and whether they use GPU
```
Adjust context size. Larger context windows use more memory. Set num_ctx in your Modelfile or API call to match your needs.
Keep models loaded. Ollama keeps the most recently used model in memory. Avoid switching between models frequently.
Use quantized models. Q4 quantizations offer the best balance of speed and quality for most use cases.
Close other GPU-intensive apps. Video editors, games, and other AI tools compete for GPU memory.

Frequently Asked Questions

Is Ollama free? Yes, Ollama is completely free and open source (MIT license). You only need a computer capable of running the models.

Can I use Ollama offline? Yes. Once you download a model, everything runs locally with no internet required.

What GPU do I need? For 7-8B models, any GPU with 6GB+ VRAM works. Apple Silicon Macs work particularly well due to unified memory. You can also run on CPU only (slower).

How does Ollama compare to LM Studio? Both run local models. Ollama is CLI-first with a REST API, making it better for developers and integrations. LM Studio has a graphical interface, making it better for non-technical users.

Can I run multiple models simultaneously? Yes, if you have enough memory. Ollama loads models on demand and can keep multiple models in memory.

Does Ollama support vision models? Yes. Models like llava and llama3.2-vision support image inputs.

Wrapping Up

Ollama makes running local AI models as simple as a single command. Whether you want complete privacy, zero API costs, or offline access, it is the best tool for local LLM inference in 2026. Start with a 7-8B model, explore the API for building applications, and scale up to larger models as your hardware allows.

If you are building applications that need AI-generated media like images, video, or talking avatars, try Hypereal AI free -- 35 credits, no credit card required. Combine local LLMs for text intelligence with Hypereal's API for visual content generation.

How to Download and Use Ollama: Step-by-Step Guide (2026)

This guide covers everything from installation to running your first model, managing multiple models, using the API, and optimizing performance.

Why Run Models Locally?

Benefit	Description
Privacy	Your data never leaves your machine
No API costs	Unlimited usage after download
Offline access	Works without internet
No rate limits	No throttling or quotas
Customization	Run fine-tuned and custom models
Speed	No network latency for local inference

The trade-off is that you need a computer with enough RAM and (ideally) a GPU. But modern quantized models run surprisingly well on consumer hardware.

Hardware Requirements

Model Size	RAM Needed	GPU VRAM	Example Models
1-3B	4GB	2GB+	Phi-4 Mini, Gemma 3 1B
7-8B	8GB	6GB+	Llama 3.1 8B, Qwen 3 8B
14B	16GB	10GB+	Qwen 3 14B, Gemma 3 12B
32-34B	32GB	24GB+	Qwen 3 32B, DeepSeek Coder 33B
70B	48GB+	48GB+	Llama 3.1 70B

Ollama can run on CPU only (slower) or use GPU acceleration with NVIDIA, AMD, or Apple Silicon GPUs. Apple Silicon Macs with unified memory are particularly good for running larger models.

Step 1: Download and Install Ollama

macOS

# Option 1: Download from the website
# Go to https://ollama.com/download and download the macOS app

# Option 2: Install via Homebrew
brew install ollama

The macOS app installs Ollama as a menu bar application that runs the server in the background.

Windows

Go to ollama.com/download.
Download the Windows installer.
Run the installer and follow the prompts.
Ollama runs as a system service after installation.

Linux

# One-line install script
curl -fsSL https://ollama.ai/install.sh | sh

# Or install manually
# Download the binary for your architecture from GitHub releases

Verify Installation

ollama --version
# Output: ollama version 0.6.x

Step 2: Download Your First Model

Ollama's model library has hundreds of models. Start by pulling a model:

# Download Llama 3.1 8B (4.7GB)
ollama pull llama3.1

# Download Qwen 3 8B (4.9GB)
ollama pull qwen3

# Download a smaller model for testing (1.6GB)
ollama pull phi4-mini

The download happens once. After that, the model loads from your local storage.

Step 3: Chat with a Model

Start an interactive chat session:

ollama run llama3.1

This opens a REPL where you can type messages:

>>> What is the capital of France?
The capital of France is Paris. It is the largest city in France and serves as
the country's political, economic, and cultural center.

>>> Write a Python function to reverse a string
Here's a simple Python function to reverse a string:

def reverse_string(s):
    return s[::-1]

# Example usage
print(reverse_string("hello"))  # Output: "olleh"

>>> /bye

Use /bye to exit the chat.

Step 4: Use the REST API

Ollama runs a local API server at http://localhost:11434. This is useful for building applications:

Chat Completion

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.1",
    "messages": [
      {"role": "user", "content": "Explain Docker in 3 sentences."}
    ]
  }'

OpenAI-Compatible Endpoint

Ollama also exposes an OpenAI-compatible endpoint, so you can use it with any OpenAI SDK:

import openai

client = openai.OpenAI(
    api_key="ollama",  # Any value works
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "How do I center a div in CSS?"}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about programming."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Step 5: Manage Models

List Downloaded Models

ollama list

# Output:
# NAME              SIZE     MODIFIED
# llama3.1:latest   4.7 GB   2 hours ago
# qwen3:latest      4.9 GB   1 hour ago
# phi4-mini:latest  1.6 GB   30 minutes ago

Remove a Model

ollama rm phi4-mini

Pull a Specific Size Variant

Many models come in multiple sizes:

# Smaller quantization (faster, less accurate)
ollama pull llama3.1:8b-q4_0

# Larger quantization (slower, more accurate)
ollama pull llama3.1:8b-q8_0

# Specific parameter count
ollama pull qwen3:14b
ollama pull qwen3:32b

Check Model Info

ollama show llama3.1

# Shows model details: parameters, quantization, template, license, etc.

Step 6: Create Custom Models with Modelfile

A Modelfile lets you customize a model's behavior:

# Modelfile
FROM llama3.1

# Set a custom system prompt
SYSTEM """You are a senior software engineer. You write clean, well-documented
code with proper error handling. Always explain your reasoning before showing code."""

# Adjust parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run your custom model:

# Create the model
ollama create my-coder -f Modelfile

# Run it
ollama run my-coder

Step 7: Use Ollama with Popular Tools

Ollama integrates with many AI tools:

With Cursor

In Cursor settings, add Ollama as a custom model provider:

Base URL: http://localhost:11434/v1
API Key: ollama
Model: llama3.1

With Continue.dev (VS Code)

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Ollama - Llama 3.1",
      "provider": "ollama",
      "model": "llama3.1"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama - Qwen Coder",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

With Open WebUI (ChatGPT-like Interface)

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 for a ChatGPT-like web interface connected to your local Ollama models.

Recommended Models for 2026

Model	Size	Best For	Command
Llama 3.1 8B	4.7GB	General purpose	`ollama pull llama3.1`
Qwen 3 8B	4.9GB	Coding + reasoning	`ollama pull qwen3`
DeepSeek Coder V2	8.9GB	Code generation	`ollama pull deepseek-coder-v2`
Gemma 3 12B	8.1GB	Instruction following	`ollama pull gemma3:12b`
Phi-4 Mini	1.6GB	Low-resource machines	`ollama pull phi4-mini`
Mistral Nemo	7.1GB	Multilingual	`ollama pull mistral-nemo`
Qwen 2.5 Coder 7B	4.7GB	Code autocomplete	`ollama pull qwen2.5-coder:7b`
Llama 3.1 70B	40GB	Maximum quality	`ollama pull llama3.1:70b`

Performance Tips

Use GPU acceleration. Ollama automatically detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon GPUs. Verify with:
```
ollama ps
# Shows which models are loaded and whether they use GPU
```
Adjust context size. Larger context windows use more memory. Set num_ctx in your Modelfile or API call to match your needs.
Keep models loaded. Ollama keeps the most recently used model in memory. Avoid switching between models frequently.
Use quantized models. Q4 quantizations offer the best balance of speed and quality for most use cases.
Close other GPU-intensive apps. Video editors, games, and other AI tools compete for GPU memory.

Frequently Asked Questions

Is Ollama free? Yes, Ollama is completely free and open source (MIT license). You only need a computer capable of running the models.

Can I use Ollama offline? Yes. Once you download a model, everything runs locally with no internet required.

What GPU do I need? For 7-8B models, any GPU with 6GB+ VRAM works. Apple Silicon Macs work particularly well due to unified memory. You can also run on CPU only (slower).

Can I run multiple models simultaneously? Yes, if you have enough memory. Ollama loads models on demand and can keep multiple models in memory.

Does Ollama support vision models? Yes. Models like llava and llama3.2-vision support image inputs.

Start Building with Hypereal

How to Download and Use Ollama: Step-by-Step Guide (2026)

Why Run Models Locally?

Hardware Requirements

Step 1: Download and Install Ollama

macOS

Windows

Linux

Verify Installation

Step 2: Download Your First Model

Step 3: Chat with a Model

Step 4: Use the REST API

Chat Completion

OpenAI-Compatible Endpoint

Streaming Responses

Step 5: Manage Models

List Downloaded Models

Remove a Model

Pull a Specific Size Variant

Check Model Info

Step 6: Create Custom Models with Modelfile

Step 7: Use Ollama with Popular Tools

With Cursor

With Continue.dev (VS Code)

With Open WebUI (ChatGPT-like Interface)

Recommended Models for 2026

Performance Tips

Frequently Asked Questions

Wrapping Up

Related Articles

Best Open Source RAG Frameworks in 2026

How to Run Qwen 3 VL Locally with Ollama (2026)

How to Use Ollama: Complete Beginner's Guide (2026)

Start Building Today

Start Building with Hypereal

How to Download and Use Ollama: Step-by-Step Guide (2026)

Why Run Models Locally?

Hardware Requirements

Step 1: Download and Install Ollama

macOS

Windows

Linux

Verify Installation

Step 2: Download Your First Model

Step 3: Chat with a Model

Step 4: Use the REST API

Chat Completion

OpenAI-Compatible Endpoint

Streaming Responses

Step 5: Manage Models

List Downloaded Models

Remove a Model

Pull a Specific Size Variant

Check Model Info

Step 6: Create Custom Models with Modelfile

Step 7: Use Ollama with Popular Tools

With Cursor

With Continue.dev (VS Code)

With Open WebUI (ChatGPT-like Interface)

Recommended Models for 2026

Performance Tips

Frequently Asked Questions

Wrapping Up

Related Articles

Best Open Source RAG Frameworks in 2026

How to Run Qwen 3 VL Locally with Ollama (2026)

How to Use Ollama: Complete Beginner's Guide (2026)

Start Building Today