How to Use LiteLLM with Ollama (2026)
Run local LLMs through a unified API proxy with LiteLLM and Ollama
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Use LiteLLM with Ollama (2026)
LiteLLM is a Python library and proxy server that provides a unified OpenAI-compatible API for over 100 LLM providers. Ollama is the easiest way to run large language models locally on your machine. Combining them gives you a local LLM setup that any OpenAI-compatible tool can connect to -- VS Code extensions, web UIs, CI pipelines, and custom applications all work without modification.
This guide covers everything from initial setup to advanced routing configurations.
Why Use LiteLLM with Ollama?
Running Ollama alone is great for local inference, but LiteLLM adds several important capabilities:
- OpenAI-compatible API: Any tool that supports the OpenAI API format works automatically.
- Model routing: Send different requests to different models based on rules you define.
- Load balancing: Distribute requests across multiple Ollama instances.
- Fallback chains: If a local model fails, fall back to a cloud provider.
- Usage tracking: Monitor token usage, latency, and costs across all models.
- Rate limiting: Protect your local resources from being overwhelmed.
Prerequisites
You need the following installed:
- Python 3.9+
- Ollama (version 0.4 or later)
- At least 8GB RAM (16GB+ recommended for larger models)
Step 1: Install Ollama
If you have not installed Ollama yet:
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows: Download the installer from ollama.ai.
Start the Ollama server:
ollama serve
By default, Ollama listens on http://localhost:11434.
Step 2: Pull Models
Download the models you want to use:
# A solid general-purpose model
ollama pull llama3.3
# A fast, lightweight model for simple tasks
ollama pull phi-4
# A coding-focused model
ollama pull deepseek-coder-v2
# A small model for quick responses
ollama pull gemma2:2b
Verify your models are available:
ollama list
Expected output:
NAME SIZE MODIFIED
llama3.3:latest 4.7 GB 2 minutes ago
phi-4:latest 2.2 GB 5 minutes ago
deepseek-coder-v2:latest 8.9 GB 7 minutes ago
gemma2:2b 1.6 GB 10 minutes ago
Step 3: Install LiteLLM
Install LiteLLM with the proxy extras:
pip install 'litellm[proxy]'
Verify the installation:
litellm --version
Step 4: Create the LiteLLM Configuration
Create a configuration file at ~/litellm_config.yaml:
model_list:
- model_name: gpt-4
litellm_params:
model: ollama/llama3.3
api_base: http://localhost:11434
- model_name: gpt-3.5-turbo
litellm_params:
model: ollama/phi-4
api_base: http://localhost:11434
- model_name: code-model
litellm_params:
model: ollama/deepseek-coder-v2
api_base: http://localhost:11434
- model_name: fast-model
litellm_params:
model: ollama/gemma2:2b
api_base: http://localhost:11434
litellm_settings:
drop_params: true
set_verbose: false
general_settings:
master_key: sk-local-dev-key-12345
The model_name field is the alias that clients will use. By mapping Ollama models to OpenAI model names like gpt-4, existing tools work without any configuration changes.
Step 5: Start the LiteLLM Proxy
litellm --config ~/litellm_config.yaml --port 4000
You should see output like:
INFO: LiteLLM Proxy running on http://0.0.0.0:4000
INFO: Loaded 4 models
Step 6: Test the Setup
Test with a curl request:
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-local-dev-key-12345" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello, what model are you?"}]
}'
Test with Python:
from openai import OpenAI
client = OpenAI(
api_key="sk-local-dev-key-12345",
base_url="http://localhost:4000/v1"
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a Python function to reverse a string"}]
)
print(response.choices[0].message.content)
Advanced Configuration: Model Routing
You can set up rules to route requests based on the task. Create a more advanced config:
model_list:
# Primary coding model
- model_name: code-model
litellm_params:
model: ollama/deepseek-coder-v2
api_base: http://localhost:11434
model_info:
max_tokens: 16384
input_cost_per_token: 0
output_cost_per_token: 0
# Fast model for autocomplete and simple tasks
- model_name: fast-model
litellm_params:
model: ollama/gemma2:2b
api_base: http://localhost:11434
model_info:
max_tokens: 8192
input_cost_per_token: 0
output_cost_per_token: 0
# General-purpose model with cloud fallback
- model_name: general
litellm_params:
model: ollama/llama3.3
api_base: http://localhost:11434
- model_name: general
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: least-busy
num_retries: 2
timeout: 120
fallbacks:
- general: [general]
litellm_settings:
drop_params: true
This configuration routes general requests to Llama 3.3 locally, and if that fails, falls back to GPT-4o-mini in the cloud.
Load Balancing Across Multiple Machines
If you have multiple machines with GPUs, point LiteLLM at all of them:
model_list:
- model_name: llama
litellm_params:
model: ollama/llama3.3
api_base: http://192.168.1.10:11434
- model_name: llama
litellm_params:
model: ollama/llama3.3
api_base: http://192.168.1.11:11434
- model_name: llama
litellm_params:
model: ollama/llama3.3
api_base: http://192.168.1.12:11434
router_settings:
routing_strategy: least-busy
LiteLLM will automatically distribute requests across all three machines.
Connecting Popular Tools
Here is how to connect common development tools to your LiteLLM + Ollama setup:
| Tool | Configuration |
|---|---|
| Cursor | Settings > Models > OpenAI API Base: http://localhost:4000/v1 |
| Continue.dev | Set apiBase to http://localhost:4000/v1 in config |
| Open WebUI | Set OpenAI URL to http://localhost:4000/v1 |
| Aider | export OPENAI_API_BASE=http://localhost:4000/v1 |
| Claude Code | ANTHROPIC_BASE_URL=http://localhost:4000 claude --model gpt-4 |
Monitoring and Debugging
LiteLLM provides a built-in dashboard. Access it at:
http://localhost:4000/ui
You can also check model health:
curl http://localhost:4000/health
View request logs:
curl http://localhost:4000/v1/logs \
-H "Authorization: Bearer sk-local-dev-key-12345"
Running as a Background Service
For a persistent setup, create a systemd service (Linux):
# /etc/systemd/system/litellm.service
[Unit]
Description=LiteLLM Proxy
After=network.target ollama.service
[Service]
Type=simple
User=your-username
ExecStart=/usr/local/bin/litellm --config /home/your-username/litellm_config.yaml --port 4000
Restart=always
[Install]
WantedBy=multi-user.target
On macOS, use a LaunchAgent or run it in a tmux session:
tmux new -d -s litellm 'litellm --config ~/litellm_config.yaml --port 4000'
Troubleshooting
| Problem | Solution |
|---|---|
| "Connection refused" on port 11434 | Run ollama serve first |
| Model downloads stuck | Check disk space with df -h |
| Slow responses | Use a smaller model or check GPU memory with nvidia-smi |
| Out of memory | Reduce context window or use a quantized model variant |
| LiteLLM cannot find model | Run ollama list and verify the model name matches your config |
Conclusion
LiteLLM and Ollama together create a powerful local LLM stack that is free, private, and compatible with the entire OpenAI ecosystem. The setup takes under 10 minutes and gives you model routing, load balancing, and monitoring out of the box.
If your project also needs AI-powered media generation -- image creation, video synthesis, voice cloning, or talking avatars -- take a look at Hypereal AI. Hypereal offers a unified API for the latest generative models with simple pay-as-you-go pricing, letting you add AI media capabilities to any application.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
