How to Run GPT-OSS Using Ollama (2026)

OpenAI has released open-source model weights under the GPT-OSS initiative, making it possible to run GPT-class models on your own hardware without sending data to OpenAI's servers. Ollama is the easiest way to get these models running locally. This guide walks you through the full setup, from installation to API integration.

What Is GPT-OSS?

GPT-OSS refers to the family of open-weight GPT models that OpenAI has released for the community. These models are available under permissive licenses and can be downloaded, modified, and deployed freely. The open-source releases include:

Model	Parameters	Context Window	VRAM Required	Best For
GPT-OSS Small	7B	32K	6 GB	Fast inference, edge devices
GPT-OSS Medium	30B	64K	20 GB	Balanced quality and speed
GPT-OSS Large	70B	128K	48 GB	Maximum quality, server deployments

These are not the same as GPT-4o or GPT-5 -- they are purpose-built open models that share architectural DNA with OpenAI's flagship products but are designed for local and self-hosted deployment.

Why Use Ollama for GPT-OSS?

You could run GPT-OSS models using raw transformers, vLLM, or llama.cpp directly. Ollama simplifies this dramatically:

One-command model download and setup -- no manual weight conversion
Automatic quantization -- run larger models on less VRAM
OpenAI-compatible API -- swap into existing applications with a base URL change
GPU auto-detection -- NVIDIA CUDA, AMD ROCm, and Apple Metal supported automatically
Model management -- list, pull, delete, and customize models easily

Prerequisites

Before starting, make sure your system is ready:

Requirement	Details
OS	macOS 12+, Linux (Ubuntu 20.04+), or Windows 10+
RAM	8 GB minimum, 16+ GB recommended
Storage	At least 10 GB free (models range from 4-40 GB)
GPU (optional)	NVIDIA GPU with 6+ GB VRAM or Apple Silicon
Internet	Required for initial model download

Step 1: Install Ollama

macOS

# Download and install via the official script
curl -fsSL https://ollama.com/install.sh | sh

# Or install via Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download or use winget:

winget install Ollama.Ollama

Verify the installation:

ollama --version
# Should output something like: ollama version 0.6.x

Step 2: Pull a GPT-OSS Model

Ollama's model library includes GPT-OSS models. Pull the one that fits your hardware:

# Pull the 7B model (smallest, runs on most hardware)
ollama pull gpt-oss:7b

# Pull the 30B model (needs 20+ GB VRAM or 32 GB RAM for CPU)
ollama pull gpt-oss:30b

# Pull a quantized version for lower VRAM
ollama pull gpt-oss:30b-q4_K_M

The download will take a few minutes depending on your connection speed. Models are cached locally in ~/.ollama/models/.

Available Quantizations

If the full model does not fit in your VRAM, use a quantized version:

Quantization	Size (7B)	Size (30B)	Quality Impact
f16 (full)	14 GB	60 GB	None
q8_0	7.5 GB	32 GB	Minimal
q4_K_M	4.5 GB	18 GB	Small
q4_0	4 GB	16 GB	Moderate

Step 3: Run the Model

Start an interactive chat session:

ollama run gpt-oss:7b

You will see a prompt where you can type messages:

>>> What are the key differences between REST and GraphQL?

REST uses fixed endpoints that return predetermined data structures, while GraphQL
exposes a single endpoint where clients specify exactly what data they need...

Press Ctrl+D or type /bye to exit.

Step 4: Use the API

Ollama starts an HTTP server on localhost:11434 automatically. You can use it with any HTTP client.

Using cURL

curl http://localhost:11434/api/chat -d '{
  "model": "gpt-oss:7b",
  "messages": [
    {"role": "user", "content": "Write a Python function to merge two sorted lists."}
  ],
  "stream": false
}'

Using the OpenAI-Compatible Endpoint

Ollama exposes an OpenAI-compatible API at /v1/, so you can use the standard OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # any string works
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="gpt-oss:7b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a binary search function in Python with type hints."}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

Using JavaScript/TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "ollama",
  baseURL: "http://localhost:11434/v1",
});

const response = await client.chat.completions.create({
  model: "gpt-oss:7b",
  messages: [
    { role: "user", content: "Explain closures in JavaScript with examples." },
  ],
});

console.log(response.choices[0].message.content);

Step 5: Customize the Model with a Modelfile

You can create a custom version of GPT-OSS with a specific system prompt, parameters, or LoRA adapters using a Modelfile:

# Modelfile
FROM gpt-oss:7b

SYSTEM "You are a senior software engineer. Always provide production-ready code with error handling, type hints, and docstrings."

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run your custom model:

# Create the custom model
ollama create gpt-oss-coder -f Modelfile

# Run it
ollama run gpt-oss-coder

Step 6: Manage Your Models

Useful commands for managing your local models:

# List all downloaded models
ollama list

# Show model details (size, quantization, parameters)
ollama show gpt-oss:7b

# Remove a model to free disk space
ollama rm gpt-oss:30b

# Copy a model (useful before customizing)
ollama cp gpt-oss:7b gpt-oss-backup:7b

Performance Tips

GPU Acceleration

Ollama auto-detects your GPU. To verify GPU usage:

# Check if GPU is being used (NVIDIA)
nvidia-smi

# Check Ollama logs for GPU detection
ollama run gpt-oss:7b --verbose

Running Multiple Models

Ollama can serve multiple models simultaneously. Each request specifies which model to use:

# Pull multiple models
ollama pull gpt-oss:7b
ollama pull gpt-oss:30b

# The API handles routing automatically
curl http://localhost:11434/api/chat -d '{"model": "gpt-oss:7b", "messages": [...]}'
curl http://localhost:11434/api/chat -d '{"model": "gpt-oss:30b", "messages": [...]}'

Increasing Context Length

By default, Ollama uses a 2048-token context window. For longer conversations or documents:

# Set context length at runtime
ollama run gpt-oss:7b --num-ctx 16384

# Or set it in your Modelfile
# PARAMETER num_ctx 16384

GPT-OSS vs. Other Open Models

Model	Parameters	License	Coding	Reasoning	Speed
GPT-OSS 7B	7B	Apache 2.0	Good	Good	Fast
Llama 3.3 70B	70B	Llama License	Excellent	Excellent	Slow
Mistral Large	123B	Apache 2.0	Very Good	Very Good	Slow
Qwen 2.5 72B	72B	Apache 2.0	Excellent	Very Good	Slow
Gemma 3 27B	27B	Gemma License	Good	Good	Medium
GPT-OSS 30B	30B	Apache 2.0	Very Good	Very Good	Medium

Troubleshooting

"Model not found" error Make sure you pulled the model first with ollama pull gpt-oss:7b. Run ollama list to see available models.

Slow inference on CPU If you do not have a GPU, use the smallest quantized model: ollama pull gpt-oss:7b-q4_0. Consider upgrading to a system with a GPU for real-time inference.

Out of memory errors Switch to a smaller quantization. If using the 30B model, try gpt-oss:30b-q4_0 or drop down to the 7B variant.

Port already in use If port 11434 is taken, set a custom port:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Wrapping Up

Running GPT-OSS models locally with Ollama gives you full control over your AI stack -- no API keys, no rate limits, no data leaving your machine. The setup takes under 10 minutes and the OpenAI-compatible API means you can plug it into almost any existing application.

If your workflow also involves AI-generated media like images, video, or talking avatars, check out Hypereal AI for a unified API that handles all types of AI media generation.

Try Hypereal AI free -- 35 credits, no credit card required.