How to Run GPT-OSS Using Ollama (2026)
Run open-source GPT models locally with a few terminal commands
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Run GPT-OSS Using Ollama (2026)
OpenAI has released open-source model weights under the GPT-OSS initiative, making it possible to run GPT-class models on your own hardware without sending data to OpenAI's servers. Ollama is the easiest way to get these models running locally. This guide walks you through the full setup, from installation to API integration.
What Is GPT-OSS?
GPT-OSS refers to the family of open-weight GPT models that OpenAI has released for the community. These models are available under permissive licenses and can be downloaded, modified, and deployed freely. The open-source releases include:
| Model | Parameters | Context Window | VRAM Required | Best For |
|---|---|---|---|---|
| GPT-OSS Small | 7B | 32K | 6 GB | Fast inference, edge devices |
| GPT-OSS Medium | 30B | 64K | 20 GB | Balanced quality and speed |
| GPT-OSS Large | 70B | 128K | 48 GB | Maximum quality, server deployments |
These are not the same as GPT-4o or GPT-5 -- they are purpose-built open models that share architectural DNA with OpenAI's flagship products but are designed for local and self-hosted deployment.
Why Use Ollama for GPT-OSS?
You could run GPT-OSS models using raw transformers, vLLM, or llama.cpp directly. Ollama simplifies this dramatically:
- One-command model download and setup -- no manual weight conversion
- Automatic quantization -- run larger models on less VRAM
- OpenAI-compatible API -- swap into existing applications with a base URL change
- GPU auto-detection -- NVIDIA CUDA, AMD ROCm, and Apple Metal supported automatically
- Model management -- list, pull, delete, and customize models easily
Prerequisites
Before starting, make sure your system is ready:
| Requirement | Details |
|---|---|
| OS | macOS 12+, Linux (Ubuntu 20.04+), or Windows 10+ |
| RAM | 8 GB minimum, 16+ GB recommended |
| Storage | At least 10 GB free (models range from 4-40 GB) |
| GPU (optional) | NVIDIA GPU with 6+ GB VRAM or Apple Silicon |
| Internet | Required for initial model download |
Step 1: Install Ollama
macOS
# Download and install via the official script
curl -fsSL https://ollama.com/install.sh | sh
# Or install via Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com/download or use winget:
winget install Ollama.Ollama
Verify the installation:
ollama --version
# Should output something like: ollama version 0.6.x
Step 2: Pull a GPT-OSS Model
Ollama's model library includes GPT-OSS models. Pull the one that fits your hardware:
# Pull the 7B model (smallest, runs on most hardware)
ollama pull gpt-oss:7b
# Pull the 30B model (needs 20+ GB VRAM or 32 GB RAM for CPU)
ollama pull gpt-oss:30b
# Pull a quantized version for lower VRAM
ollama pull gpt-oss:30b-q4_K_M
The download will take a few minutes depending on your connection speed. Models are cached locally in ~/.ollama/models/.
Available Quantizations
If the full model does not fit in your VRAM, use a quantized version:
| Quantization | Size (7B) | Size (30B) | Quality Impact |
|---|---|---|---|
| f16 (full) | 14 GB | 60 GB | None |
| q8_0 | 7.5 GB | 32 GB | Minimal |
| q4_K_M | 4.5 GB | 18 GB | Small |
| q4_0 | 4 GB | 16 GB | Moderate |
Step 3: Run the Model
Start an interactive chat session:
ollama run gpt-oss:7b
You will see a prompt where you can type messages:
>>> What are the key differences between REST and GraphQL?
REST uses fixed endpoints that return predetermined data structures, while GraphQL
exposes a single endpoint where clients specify exactly what data they need...
Press Ctrl+D or type /bye to exit.
Step 4: Use the API
Ollama starts an HTTP server on localhost:11434 automatically. You can use it with any HTTP client.
Using cURL
curl http://localhost:11434/api/chat -d '{
"model": "gpt-oss:7b",
"messages": [
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
],
"stream": false
}'
Using the OpenAI-Compatible Endpoint
Ollama exposes an OpenAI-compatible API at /v1/, so you can use the standard OpenAI SDK:
from openai import OpenAI
client = OpenAI(
api_key="ollama", # any string works
base_url="http://localhost:11434/v1"
)
response = client.chat.completions.create(
model="gpt-oss:7b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a binary search function in Python with type hints."}
],
temperature=0.7
)
print(response.choices[0].message.content)
Using JavaScript/TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "ollama",
baseURL: "http://localhost:11434/v1",
});
const response = await client.chat.completions.create({
model: "gpt-oss:7b",
messages: [
{ role: "user", content: "Explain closures in JavaScript with examples." },
],
});
console.log(response.choices[0].message.content);
Step 5: Customize the Model with a Modelfile
You can create a custom version of GPT-OSS with a specific system prompt, parameters, or LoRA adapters using a Modelfile:
# Modelfile
FROM gpt-oss:7b
SYSTEM "You are a senior software engineer. Always provide production-ready code with error handling, type hints, and docstrings."
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
Build and run your custom model:
# Create the custom model
ollama create gpt-oss-coder -f Modelfile
# Run it
ollama run gpt-oss-coder
Step 6: Manage Your Models
Useful commands for managing your local models:
# List all downloaded models
ollama list
# Show model details (size, quantization, parameters)
ollama show gpt-oss:7b
# Remove a model to free disk space
ollama rm gpt-oss:30b
# Copy a model (useful before customizing)
ollama cp gpt-oss:7b gpt-oss-backup:7b
Performance Tips
GPU Acceleration
Ollama auto-detects your GPU. To verify GPU usage:
# Check if GPU is being used (NVIDIA)
nvidia-smi
# Check Ollama logs for GPU detection
ollama run gpt-oss:7b --verbose
Running Multiple Models
Ollama can serve multiple models simultaneously. Each request specifies which model to use:
# Pull multiple models
ollama pull gpt-oss:7b
ollama pull gpt-oss:30b
# The API handles routing automatically
curl http://localhost:11434/api/chat -d '{"model": "gpt-oss:7b", "messages": [...]}'
curl http://localhost:11434/api/chat -d '{"model": "gpt-oss:30b", "messages": [...]}'
Increasing Context Length
By default, Ollama uses a 2048-token context window. For longer conversations or documents:
# Set context length at runtime
ollama run gpt-oss:7b --num-ctx 16384
# Or set it in your Modelfile
# PARAMETER num_ctx 16384
GPT-OSS vs. Other Open Models
| Model | Parameters | License | Coding | Reasoning | Speed |
|---|---|---|---|---|---|
| GPT-OSS 7B | 7B | Apache 2.0 | Good | Good | Fast |
| Llama 3.3 70B | 70B | Llama License | Excellent | Excellent | Slow |
| Mistral Large | 123B | Apache 2.0 | Very Good | Very Good | Slow |
| Qwen 2.5 72B | 72B | Apache 2.0 | Excellent | Very Good | Slow |
| Gemma 3 27B | 27B | Gemma License | Good | Good | Medium |
| GPT-OSS 30B | 30B | Apache 2.0 | Very Good | Very Good | Medium |
Troubleshooting
"Model not found" error
Make sure you pulled the model first with ollama pull gpt-oss:7b. Run ollama list to see available models.
Slow inference on CPU
If you do not have a GPU, use the smallest quantized model: ollama pull gpt-oss:7b-q4_0. Consider upgrading to a system with a GPU for real-time inference.
Out of memory errors
Switch to a smaller quantization. If using the 30B model, try gpt-oss:30b-q4_0 or drop down to the 7B variant.
Port already in use If port 11434 is taken, set a custom port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Wrapping Up
Running GPT-OSS models locally with Ollama gives you full control over your AI stack -- no API keys, no rate limits, no data leaving your machine. The setup takes under 10 minutes and the OpenAI-compatible API means you can plug it into almost any existing application.
If your workflow also involves AI-generated media like images, video, or talking avatars, check out Hypereal AI for a unified API that handles all types of AI media generation.
Try Hypereal AI free -- 35 credits, no credit card required.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
