How to Use LiteLLM with Ollama (2026)

LiteLLM is a Python library and proxy server that provides a unified OpenAI-compatible API for over 100 LLM providers. Ollama is the easiest way to run large language models locally on your machine. Combining them gives you a local LLM setup that any OpenAI-compatible tool can connect to -- VS Code extensions, web UIs, CI pipelines, and custom applications all work without modification.

This guide covers everything from initial setup to advanced routing configurations.

Why Use LiteLLM with Ollama?

Running Ollama alone is great for local inference, but LiteLLM adds several important capabilities:

OpenAI-compatible API: Any tool that supports the OpenAI API format works automatically.
Model routing: Send different requests to different models based on rules you define.
Load balancing: Distribute requests across multiple Ollama instances.
Fallback chains: If a local model fails, fall back to a cloud provider.
Usage tracking: Monitor token usage, latency, and costs across all models.
Rate limiting: Protect your local resources from being overwhelmed.

Prerequisites

You need the following installed:

Python 3.9+
Ollama (version 0.4 or later)
At least 8GB RAM (16GB+ recommended for larger models)

Step 1: Install Ollama

If you have not installed Ollama yet:

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download the installer from ollama.ai.

Start the Ollama server:

ollama serve

By default, Ollama listens on http://localhost:11434.

Step 2: Pull Models

Download the models you want to use:

# A solid general-purpose model
ollama pull llama3.3

# A fast, lightweight model for simple tasks
ollama pull phi-4

# A coding-focused model
ollama pull deepseek-coder-v2

# A small model for quick responses
ollama pull gemma2:2b

Verify your models are available:

ollama list

Expected output:

NAME                    SIZE    MODIFIED
llama3.3:latest         4.7 GB  2 minutes ago
phi-4:latest            2.2 GB  5 minutes ago
deepseek-coder-v2:latest 8.9 GB  7 minutes ago
gemma2:2b               1.6 GB  10 minutes ago

Step 3: Install LiteLLM

Install LiteLLM with the proxy extras:

pip install 'litellm[proxy]'

Verify the installation:

litellm --version

Step 4: Create the LiteLLM Configuration

Create a configuration file at ~/litellm_config.yaml:

model_list:
  - model_name: gpt-4
    litellm_params:
      model: ollama/llama3.3
      api_base: http://localhost:11434

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: ollama/phi-4
      api_base: http://localhost:11434

  - model_name: code-model
    litellm_params:
      model: ollama/deepseek-coder-v2
      api_base: http://localhost:11434

  - model_name: fast-model
    litellm_params:
      model: ollama/gemma2:2b
      api_base: http://localhost:11434

litellm_settings:
  drop_params: true
  set_verbose: false

general_settings:
  master_key: sk-local-dev-key-12345

The model_name field is the alias that clients will use. By mapping Ollama models to OpenAI model names like gpt-4, existing tools work without any configuration changes.

Step 5: Start the LiteLLM Proxy

litellm --config ~/litellm_config.yaml --port 4000

You should see output like:

INFO: LiteLLM Proxy running on http://0.0.0.0:4000
INFO: Loaded 4 models

Step 6: Test the Setup

Test with a curl request:

curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-local-dev-key-12345" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello, what model are you?"}]
  }'

Test with Python:

from openai import OpenAI

client = OpenAI(
    api_key="sk-local-dev-key-12345",
    base_url="http://localhost:4000/v1"
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a Python function to reverse a string"}]
)

print(response.choices[0].message.content)

Advanced Configuration: Model Routing

You can set up rules to route requests based on the task. Create a more advanced config:

model_list:
  # Primary coding model
  - model_name: code-model
    litellm_params:
      model: ollama/deepseek-coder-v2
      api_base: http://localhost:11434
    model_info:
      max_tokens: 16384
      input_cost_per_token: 0
      output_cost_per_token: 0

  # Fast model for autocomplete and simple tasks
  - model_name: fast-model
    litellm_params:
      model: ollama/gemma2:2b
      api_base: http://localhost:11434
    model_info:
      max_tokens: 8192
      input_cost_per_token: 0
      output_cost_per_token: 0

  # General-purpose model with cloud fallback
  - model_name: general
    litellm_params:
      model: ollama/llama3.3
      api_base: http://localhost:11434

  - model_name: general
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: least-busy
  num_retries: 2
  timeout: 120
  fallbacks:
    - general: [general]

litellm_settings:
  drop_params: true

This configuration routes general requests to Llama 3.3 locally, and if that fails, falls back to GPT-4o-mini in the cloud.

Load Balancing Across Multiple Machines

If you have multiple machines with GPUs, point LiteLLM at all of them:

model_list:
  - model_name: llama
    litellm_params:
      model: ollama/llama3.3
      api_base: http://192.168.1.10:11434

  - model_name: llama
    litellm_params:
      model: ollama/llama3.3
      api_base: http://192.168.1.11:11434

  - model_name: llama
    litellm_params:
      model: ollama/llama3.3
      api_base: http://192.168.1.12:11434

router_settings:
  routing_strategy: least-busy

LiteLLM will automatically distribute requests across all three machines.

Connecting Popular Tools

Here is how to connect common development tools to your LiteLLM + Ollama setup:

Tool	Configuration
Cursor	Settings > Models > OpenAI API Base: `http://localhost:4000/v1`
Continue.dev	Set `apiBase` to `http://localhost:4000/v1` in config
Open WebUI	Set OpenAI URL to `http://localhost:4000/v1`
Aider	`export OPENAI_API_BASE=http://localhost:4000/v1`
Claude Code	`ANTHROPIC_BASE_URL=http://localhost:4000 claude --model gpt-4`

Monitoring and Debugging

LiteLLM provides a built-in dashboard. Access it at:

http://localhost:4000/ui

You can also check model health:

curl http://localhost:4000/health

View request logs:

curl http://localhost:4000/v1/logs \
  -H "Authorization: Bearer sk-local-dev-key-12345"

Running as a Background Service

For a persistent setup, create a systemd service (Linux):

# /etc/systemd/system/litellm.service
[Unit]
Description=LiteLLM Proxy
After=network.target ollama.service

[Service]
Type=simple
User=your-username
ExecStart=/usr/local/bin/litellm --config /home/your-username/litellm_config.yaml --port 4000
Restart=always

[Install]
WantedBy=multi-user.target

On macOS, use a LaunchAgent or run it in a tmux session:

tmux new -d -s litellm 'litellm --config ~/litellm_config.yaml --port 4000'

Troubleshooting

Problem	Solution
"Connection refused" on port 11434	Run `ollama serve` first
Model downloads stuck	Check disk space with `df -h`
Slow responses	Use a smaller model or check GPU memory with `nvidia-smi`
Out of memory	Reduce context window or use a quantized model variant
LiteLLM cannot find model	Run `ollama list` and verify the model name matches your config

Conclusion

LiteLLM and Ollama together create a powerful local LLM stack that is free, private, and compatible with the entire OpenAI ecosystem. The setup takes under 10 minutes and gives you model routing, load balancing, and monitoring out of the box.

If your project also needs AI-powered media generation -- image creation, video synthesis, voice cloning, or talking avatars -- take a look at Hypereal AI. Hypereal offers a unified API for the latest generative models with simple pay-as-you-go pricing, letting you add AI media capabilities to any application.