How to Use vLLM: Fast LLM Inference Engine Guide (2026)
Deploy and optimize LLMs with the fastest open-source serving engine
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Use vLLM: Fast LLM Inference Engine Guide (2026)
vLLM is the fastest open-source inference engine for serving large language models. Developed at UC Berkeley, it uses a technique called PagedAttention to deliver up to 24x higher throughput than naive HuggingFace implementations. If you need to serve an LLM in production, self-host a model for your team, or just run local inference efficiently, vLLM is the tool to learn.
This guide covers installation, basic serving, advanced optimization, and production deployment.
Why vLLM?
Before diving in, here is how vLLM compares to other serving options:
| Feature | vLLM | TGI (HuggingFace) | Ollama | llama.cpp | TensorRT-LLM |
|---|---|---|---|---|---|
| Throughput | Excellent | Good | Medium | Medium | Excellent |
| Latency | Low | Low | Medium | Low | Very Low |
| GPU support | NVIDIA, AMD, TPU | NVIDIA | NVIDIA, Apple | CPU, NVIDIA, Apple | NVIDIA only |
| Model support | HuggingFace, GGUF | HuggingFace | GGUF | GGUF | Custom format |
| OpenAI-compatible API | Yes | Yes | Yes | Yes (server mode) | Yes |
| Tensor parallelism | Yes | Yes | No | No | Yes |
| Speculative decoding | Yes | Limited | No | Yes | Yes |
| Production-ready | Yes | Yes | No | No | Yes |
| Setup difficulty | Medium | Medium | Easy | Easy | Hard |
vLLM excels at high-throughput serving with multiple concurrent users. If you are building an API that needs to handle hundreds or thousands of requests per minute, vLLM is the right choice.
Step 1: Installation
Prerequisites
- NVIDIA GPU with CUDA 12.1+ (or AMD GPU with ROCm 6.0+)
- Python 3.9-3.12
- At least 16GB GPU VRAM for 7B models, 24GB+ for 13B, 80GB+ for 70B
Install via pip
# Create a virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM
pip install vllm
# Verify installation
python -c "import vllm; print(vllm.__version__)"
Install via Docker (Recommended for Production)
# Pull the official vLLM Docker image
docker pull vllm/vllm-openai:latest
# Run with GPU access
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4
Install from Source (Latest Features)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Step 2: Serve Your First Model
Start the OpenAI-Compatible Server
# Serve Llama 3.3 8B (fits on a single 24GB GPU)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-8B-Instruct \
--port 8000
# The server starts and listens on http://localhost:8000
Test the Server
# Using curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-8B-Instruct",
"messages": [
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
],
"temperature": 0.7,
"max_tokens": 512
}'
# Using the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM does not require auth by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain async/await in Python with examples"}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)
// Using the OpenAI Node.js SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: "dummy",
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-8B-Instruct",
messages: [
{ role: "user", content: "Write a React hook for debouncing input" },
],
temperature: 0.7,
max_tokens: 1024,
});
console.log(response.choices[0].message.content);
Step 3: Optimize Performance
Tensor Parallelism (Multiple GPUs)
Split a large model across multiple GPUs for faster inference:
# Serve a 70B model across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000
| Model Size | GPUs Needed | TP Size | VRAM per GPU |
|---|---|---|---|
| 7-8B | 1x 24GB | 1 | ~16GB |
| 13B | 1x 40GB or 2x 24GB | 1-2 | ~26GB |
| 34B | 2x 40GB or 4x 24GB | 2-4 | ~18GB/GPU |
| 70B | 4x 24GB or 2x 80GB | 2-4 | ~18-40GB/GPU |
| 405B | 8x 80GB | 8 | ~52GB/GPU |
Quantization
Serve quantized models to reduce VRAM usage while maintaining quality:
# Serve a GPTQ-quantized model (4-bit)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3.3-70B-Instruct-GPTQ \
--quantization gptq \
--tensor-parallel-size 2 \
--port 8000
# Serve an AWQ-quantized model
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3.3-70B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 2 \
--port 8000
# FP8 quantization (best quality/size tradeoff on H100)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--tensor-parallel-size 2 \
--port 8000
Speculative Decoding
Use a smaller "draft" model to speed up generation from the larger "target" model:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model meta-llama/Llama-3.3-8B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4 \
--port 8000
Speculative decoding can improve throughput by 1.5-2x for many workloads without any quality loss.
Prefix Caching
Enable automatic prefix caching for workloads with repeated system prompts:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--enable-prefix-caching \
--tensor-parallel-size 4 \
--port 8000
This dramatically speeds up requests that share the same system prompt or context prefix.
Chunked Prefill
For long-context workloads, enable chunked prefill to maintain low latency:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--port 8000
Step 4: Production Configuration
Recommended Production Launch Command
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000 \
--host 0.0.0.0 \
--max-model-len 32768 \
--max-num-seqs 256 \
--enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--disable-log-requests \
--api-key "your-secret-api-key"
Key Production Parameters
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--gpu-memory-utilization |
Fraction of GPU memory to use | 0.9 | 0.90-0.95 |
--max-model-len |
Maximum sequence length | Model default | Set based on use case |
--max-num-seqs |
Max concurrent sequences | 256 | Based on VRAM |
--enable-prefix-caching |
Cache common prefixes | Off | On (production) |
--disable-log-requests |
Reduce logging overhead | Off | On (production) |
--api-key |
Require API key auth | None | Always set |
Docker Compose for Production
# docker-compose.yml
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- huggingface-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 4
--port 8000
--host 0.0.0.0
--max-model-len 32768
--enable-prefix-caching
--gpu-memory-utilization 0.92
--api-key ${VLLM_API_KEY}
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
nginx:
image: nginx:alpine
ports:
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
- ./certs:/etc/nginx/certs
depends_on:
- vllm
volumes:
huggingface-cache:
Monitoring with Prometheus
vLLM exposes Prometheus metrics out of the box:
# Metrics are available at /metrics
curl http://localhost:8000/metrics
Key metrics to monitor:
| Metric | Description |
|---|---|
vllm:num_requests_running |
Active requests |
vllm:num_requests_waiting |
Queued requests |
vllm:avg_generation_throughput_toks_per_s |
Tokens per second |
vllm:avg_prompt_throughput_toks_per_s |
Prompt processing speed |
vllm:gpu_cache_usage_perc |
KV cache utilization |
Step 5: Serve Popular Models
Qwen 2.5 Coder (Best for Coding)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-32B-Instruct \
--tensor-parallel-size 2 \
--port 8000
DeepSeek V3 (Best Open-Source General Model)
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--trust-remote-code \
--port 8000
Mistral Large (Strong European Model)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Large-Instruct-2411 \
--tensor-parallel-size 4 \
--port 8000
Benchmarking Your Deployment
vLLM includes a benchmarking tool:
# Benchmark throughput
python -m vllm.entrypoints.openai.bench_serving \
--backend vllm \
--base-url http://localhost:8000 \
--model meta-llama/Llama-3.3-70B-Instruct \
--num-prompts 100 \
--request-rate 10
Expected Throughput (Llama 3.3 70B, 4x A100 80GB)
| Concurrent Users | Tokens/sec | Latency (p50) | Latency (p99) |
|---|---|---|---|
| 1 | 85 tok/s | 12ms/tok | 18ms/tok |
| 10 | 620 tok/s | 16ms/tok | 35ms/tok |
| 50 | 2,100 tok/s | 24ms/tok | 65ms/tok |
| 100 | 3,200 tok/s | 31ms/tok | 120ms/tok |
Frequently Asked Questions
What GPUs does vLLM support? NVIDIA GPUs with compute capability 7.0+ (V100, T4, A10, A100, H100, L40S, RTX 3090/4090). AMD GPUs with ROCm 6.0+ (MI210, MI250, MI300X). Google TPUs via JAX backend.
Can I serve multiple models on one GPU? Not directly in a single vLLM instance. You can run multiple vLLM instances on different ports, each serving a different model, and use a load balancer to route requests.
How does vLLM compare to Ollama for personal use? Ollama is easier to set up and is great for single-user local inference. vLLM is designed for multi-user serving with high throughput. Use Ollama for personal use and vLLM for team/production deployments.
Does vLLM support streaming?
Yes. Use "stream": true in your API request to get token-by-token streaming responses.
Can I fine-tune models and then serve them with vLLM? Yes. Train with any framework (HuggingFace, Axolotl, etc.) and point vLLM at the saved model directory or LoRA adapter.
Wrapping Up
vLLM is the gold standard for self-hosted LLM serving in 2026. Its PagedAttention architecture, tensor parallelism, and speculative decoding deliver unmatched throughput for production deployments. Start with a simple single-GPU setup, benchmark your workload, then scale to multi-GPU configurations as needed.
For applications that combine LLM text generation with AI-generated media, Hypereal AI provides production-ready APIs for video generation, AI avatars, and image synthesis. Pair your self-hosted vLLM instance with Hypereal's media APIs for a complete AI stack. Get started with 35 free credits.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
