How to Use vLLM: Fast LLM Inference Engine Guide (2026)

vLLM is the fastest open-source inference engine for serving large language models. Developed at UC Berkeley, it uses a technique called PagedAttention to deliver up to 24x higher throughput than naive HuggingFace implementations. If you need to serve an LLM in production, self-host a model for your team, or just run local inference efficiently, vLLM is the tool to learn.

This guide covers installation, basic serving, advanced optimization, and production deployment.

Why vLLM?

Before diving in, here is how vLLM compares to other serving options:

Feature	vLLM	TGI (HuggingFace)	Ollama	llama.cpp	TensorRT-LLM
Throughput	Excellent	Good	Medium	Medium	Excellent
Latency	Low	Low	Medium	Low	Very Low
GPU support	NVIDIA, AMD, TPU	NVIDIA	NVIDIA, Apple	CPU, NVIDIA, Apple	NVIDIA only
Model support	HuggingFace, GGUF	HuggingFace	GGUF	GGUF	Custom format
OpenAI-compatible API	Yes	Yes	Yes	Yes (server mode)	Yes
Tensor parallelism	Yes	Yes	No	No	Yes
Speculative decoding	Yes	Limited	No	Yes	Yes
Production-ready	Yes	Yes	No	No	Yes
Setup difficulty	Medium	Medium	Easy	Easy	Hard

vLLM excels at high-throughput serving with multiple concurrent users. If you are building an API that needs to handle hundreds or thousands of requests per minute, vLLM is the right choice.

Step 1: Installation

Prerequisites

NVIDIA GPU with CUDA 12.1+ (or AMD GPU with ROCm 6.0+)
Python 3.9-3.12
At least 16GB GPU VRAM for 7B models, 24GB+ for 13B, 80GB+ for 70B

Install via pip

# Create a virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

Install via Docker (Recommended for Production)

# Pull the official vLLM Docker image
docker pull vllm/vllm-openai:latest

# Run with GPU access
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4

Install from Source (Latest Features)

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Step 2: Serve Your First Model

Start the OpenAI-Compatible Server

# Serve Llama 3.3 8B (fits on a single 24GB GPU)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-8B-Instruct \
  --port 8000

# The server starts and listens on http://localhost:8000

Test the Server

# Using curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

# Using the OpenAI Python SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM does not require auth by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain async/await in Python with examples"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

// Using the OpenAI Node.js SDK
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: "dummy",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.3-8B-Instruct",
  messages: [
    { role: "user", content: "Write a React hook for debouncing input" },
  ],
  temperature: 0.7,
  max_tokens: 1024,
});

console.log(response.choices[0].message.content);

Step 3: Optimize Performance

Tensor Parallelism (Multiple GPUs)

Split a large model across multiple GPUs for faster inference:

# Serve a 70B model across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000

Model Size	GPUs Needed	TP Size	VRAM per GPU
7-8B	1x 24GB	1	~16GB
13B	1x 40GB or 2x 24GB	1-2	~26GB
34B	2x 40GB or 4x 24GB	2-4	~18GB/GPU
70B	4x 24GB or 2x 80GB	2-4	~18-40GB/GPU
405B	8x 80GB	8	~52GB/GPU

Quantization

Serve quantized models to reduce VRAM usage while maintaining quality:

# Serve a GPTQ-quantized model (4-bit)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3.3-70B-Instruct-GPTQ \
  --quantization gptq \
  --tensor-parallel-size 2 \
  --port 8000

# Serve an AWQ-quantized model
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3.3-70B-Instruct-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --port 8000

# FP8 quantization (best quality/size tradeoff on H100)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --port 8000

Speculative Decoding

Use a smaller "draft" model to speed up generation from the larger "target" model:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --speculative-model meta-llama/Llama-3.3-8B-Instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4 \
  --port 8000

Speculative decoding can improve throughput by 1.5-2x for many workloads without any quality loss.

Prefix Caching

Enable automatic prefix caching for workloads with repeated system prompts:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --enable-prefix-caching \
  --tensor-parallel-size 4 \
  --port 8000

This dramatically speeds up requests that share the same system prompt or context prefix.

Chunked Prefill

For long-context workloads, enable chunked prefill to maintain low latency:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --port 8000

Step 4: Production Configuration

Recommended Production Launch Command

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000 \
  --host 0.0.0.0 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --disable-log-requests \
  --api-key "your-secret-api-key"

Key Production Parameters

Parameter	Description	Default	Recommended
`--gpu-memory-utilization`	Fraction of GPU memory to use	0.9	0.90-0.95
`--max-model-len`	Maximum sequence length	Model default	Set based on use case
`--max-num-seqs`	Max concurrent sequences	256	Based on VRAM
`--enable-prefix-caching`	Cache common prefixes	Off	On (production)
`--disable-log-requests`	Reduce logging overhead	Off	On (production)
`--api-key`	Require API key auth	None	Always set

Docker Compose for Production

# docker-compose.yml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - huggingface-cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.3-70B-Instruct
      --tensor-parallel-size 4
      --port 8000
      --host 0.0.0.0
      --max-model-len 32768
      --enable-prefix-caching
      --gpu-memory-utilization 0.92
      --api-key ${VLLM_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  nginx:
    image: nginx:alpine
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./certs:/etc/nginx/certs
    depends_on:
      - vllm

volumes:
  huggingface-cache:

Monitoring with Prometheus

vLLM exposes Prometheus metrics out of the box:

# Metrics are available at /metrics
curl http://localhost:8000/metrics

Key metrics to monitor:

Metric	Description
`vllm:num_requests_running`	Active requests
`vllm:num_requests_waiting`	Queued requests
`vllm:avg_generation_throughput_toks_per_s`	Tokens per second
`vllm:avg_prompt_throughput_toks_per_s`	Prompt processing speed
`vllm:gpu_cache_usage_perc`	KV cache utilization

Step 5: Serve Popular Models

Qwen 2.5 Coder (Best for Coding)

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --tensor-parallel-size 2 \
  --port 8000

DeepSeek V3 (Best Open-Source General Model)

python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --port 8000

Mistral Large (Strong European Model)

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-Large-Instruct-2411 \
  --tensor-parallel-size 4 \
  --port 8000

Benchmarking Your Deployment

vLLM includes a benchmarking tool:

# Benchmark throughput
python -m vllm.entrypoints.openai.bench_serving \
  --backend vllm \
  --base-url http://localhost:8000 \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --num-prompts 100 \
  --request-rate 10

Expected Throughput (Llama 3.3 70B, 4x A100 80GB)

Concurrent Users	Tokens/sec	Latency (p50)	Latency (p99)
1	85 tok/s	12ms/tok	18ms/tok
10	620 tok/s	16ms/tok	35ms/tok
50	2,100 tok/s	24ms/tok	65ms/tok
100	3,200 tok/s	31ms/tok	120ms/tok

Frequently Asked Questions

What GPUs does vLLM support? NVIDIA GPUs with compute capability 7.0+ (V100, T4, A10, A100, H100, L40S, RTX 3090/4090). AMD GPUs with ROCm 6.0+ (MI210, MI250, MI300X). Google TPUs via JAX backend.

Can I serve multiple models on one GPU? Not directly in a single vLLM instance. You can run multiple vLLM instances on different ports, each serving a different model, and use a load balancer to route requests.

How does vLLM compare to Ollama for personal use? Ollama is easier to set up and is great for single-user local inference. vLLM is designed for multi-user serving with high throughput. Use Ollama for personal use and vLLM for team/production deployments.

Does vLLM support streaming? Yes. Use "stream": true in your API request to get token-by-token streaming responses.

Can I fine-tune models and then serve them with vLLM? Yes. Train with any framework (HuggingFace, Axolotl, etc.) and point vLLM at the saved model directory or LoRA adapter.

Wrapping Up

vLLM is the gold standard for self-hosted LLM serving in 2026. Its PagedAttention architecture, tensor parallelism, and speculative decoding deliver unmatched throughput for production deployments. Start with a simple single-GPU setup, benchmark your workload, then scale to multi-GPU configurations as needed.

For applications that combine LLM text generation with AI-generated media, Hypereal AI provides production-ready APIs for video generation, AI avatars, and image synthesis. Pair your self-hosted vLLM instance with Hypereal's media APIs for a complete AI stack. Get started with 35 free credits.