How to Use Qwen 3 Embedding & Reranker with Ollama (2026)
Set up Qwen 3 embedding models for RAG pipelines locally
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Use Qwen 3 Embedding & Reranker with Ollama (2026)
Qwen 3 Embedding and Qwen 3 Reranker are Alibaba's latest open-source models purpose-built for retrieval-augmented generation (RAG) pipelines. They deliver state-of-the-art performance on embedding benchmarks while running efficiently on consumer hardware through Ollama.
This guide walks you through setting up both models locally, integrating them into a RAG pipeline, and optimizing performance for production use.
What Are Embedding and Reranker Models?
Before diving into setup, let us clarify what these models do in a RAG pipeline:
| Model | Role | What It Does |
|---|---|---|
| Embedding model | Retrieval | Converts text into numerical vectors for semantic search |
| Reranker model | Refinement | Re-scores retrieved documents for relevance to the query |
| LLM | Generation | Generates the final answer using retrieved context |
A typical RAG pipeline works like this:
Query → Embedding Model → Vector Search → Top-K Results
→ Reranker → Re-ranked Top-N → LLM → Answer
The embedding model finds candidates quickly, and the reranker filters them for precision. Using both together dramatically improves answer quality compared to embeddings alone.
Qwen 3 Embedding: Model Variants
Qwen 3 offers multiple embedding model sizes:
| Model | Parameters | Dimensions | Max Tokens | MTEB Score | VRAM Required |
|---|---|---|---|---|---|
| qwen3-embedding-0.6b | 0.6B | 1024 | 8,192 | 68.2 | ~1.5 GB |
| qwen3-embedding-1.5b | 1.5B | 1536 | 8,192 | 71.8 | ~3 GB |
| qwen3-embedding-4b | 4B | 2560 | 32,768 | 74.1 | ~6 GB |
| qwen3-embedding-8b | 8B | 4096 | 32,768 | 76.3 | ~10 GB |
The 4B model hits the sweet spot between quality and performance for most use cases.
Step 1: Install Ollama
If you do not have Ollama installed:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Start the Ollama server:
ollama serve
Step 2: Pull the Qwen 3 Embedding Model
# Pull the recommended 4B model
ollama pull qwen3-embedding:4b
# Or pull a smaller model for lower VRAM systems
ollama pull qwen3-embedding:1.5b
# Or the largest for maximum quality
ollama pull qwen3-embedding:8b
Verify the model is available:
ollama list
Step 3: Generate Embeddings
Using the Ollama API
Generate embeddings via the REST API:
curl http://localhost:11434/api/embed -d '{
"model": "qwen3-embedding:4b",
"input": "How does retrieval-augmented generation work?"
}'
Response:
{
"model": "qwen3-embedding:4b",
"embeddings": [
[0.0123, -0.0456, 0.0789, ...]
]
}
Using Python
import ollama
# Single text embedding
response = ollama.embed(
model="qwen3-embedding:4b",
input="How does retrieval-augmented generation work?"
)
embedding = response["embeddings"][0]
print(f"Dimensions: {len(embedding)}") # 2560 for 4B model
# Batch embeddings
documents = [
"RAG combines retrieval with generation for accurate answers.",
"Vector databases store embeddings for fast similarity search.",
"Rerankers improve retrieval precision by re-scoring candidates.",
]
response = ollama.embed(
model="qwen3-embedding:4b",
input=documents
)
embeddings = response["embeddings"]
print(f"Generated {len(embeddings)} embeddings")
Step 4: Set Up the Qwen 3 Reranker
Pull the reranker model:
ollama pull qwen3-reranker:4b
The reranker works differently from the embedding model. Instead of producing vectors, it takes a query-document pair and returns a relevance score.
Using the Reranker
import ollama
import json
def rerank(query: str, documents: list[str], model: str = "qwen3-reranker:4b") -> list[dict]:
"""Rerank documents by relevance to the query."""
scored = []
for doc in documents:
# The reranker expects a specific prompt format
prompt = f"Query: {query}\nDocument: {doc}\nRelevance:"
response = ollama.generate(
model=model,
prompt=prompt,
options={"temperature": 0}
)
# Parse the relevance score from the response
try:
score = float(response["response"].strip())
except ValueError:
score = 0.0
scored.append({"document": doc, "score": score})
# Sort by score descending
scored.sort(key=lambda x: x["score"], reverse=True)
return scored
# Example usage
query = "What are the benefits of RAG?"
documents = [
"RAG reduces hallucinations by grounding responses in retrieved data.",
"The weather in Tokyo is currently 22 degrees celsius.",
"Retrieval-augmented generation improves factual accuracy of LLM outputs.",
"Python is a popular programming language for data science.",
]
results = rerank(query, documents)
for r in results:
print(f"Score: {r['score']:.3f} | {r['document'][:60]}")
Step 5: Build a Complete RAG Pipeline
Here is a complete example using Qwen 3 Embedding for retrieval and Qwen 3 Reranker for refinement:
import ollama
import numpy as np
from typing import Optional
class SimpleRAG:
def __init__(
self,
embed_model: str = "qwen3-embedding:4b",
rerank_model: str = "qwen3-reranker:4b",
llm_model: str = "qwen3:8b",
):
self.embed_model = embed_model
self.rerank_model = rerank_model
self.llm_model = llm_model
self.documents: list[str] = []
self.embeddings: list[list[float]] = []
def add_documents(self, documents: list[str]):
"""Add documents to the knowledge base."""
self.documents.extend(documents)
response = ollama.embed(model=self.embed_model, input=documents)
self.embeddings.extend(response["embeddings"])
print(f"Added {len(documents)} documents. Total: {len(self.documents)}")
def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(self, query: str, top_k: int = 10) -> list[dict]:
"""Retrieve top-k documents by embedding similarity."""
query_embedding = ollama.embed(
model=self.embed_model, input=query
)["embeddings"][0]
scored = []
for i, doc_embedding in enumerate(self.embeddings):
score = self._cosine_similarity(query_embedding, doc_embedding)
scored.append({"index": i, "document": self.documents[i], "score": score})
scored.sort(key=lambda x: x["score"], reverse=True)
return scored[:top_k]
def rerank(self, query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
"""Rerank candidates using the reranker model."""
for candidate in candidates:
prompt = f"Query: {query}\nDocument: {candidate['document']}\nRelevance:"
response = ollama.generate(
model=self.rerank_model,
prompt=prompt,
options={"temperature": 0}
)
try:
candidate["rerank_score"] = float(response["response"].strip())
except ValueError:
candidate["rerank_score"] = 0.0
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:top_n]
def query(self, question: str, top_k: int = 10, top_n: int = 3) -> str:
"""Full RAG pipeline: retrieve, rerank, generate."""
# Step 1: Retrieve
candidates = self.retrieve(question, top_k=top_k)
# Step 2: Rerank
reranked = self.rerank(question, candidates, top_n=top_n)
# Step 3: Generate
context = "\n\n".join([r["document"] for r in reranked])
prompt = f"""Answer the question based on the provided context.
Context:
{context}
Question: {question}
Answer:"""
response = ollama.generate(model=self.llm_model, prompt=prompt)
return response["response"]
# Usage
rag = SimpleRAG()
# Add your knowledge base
rag.add_documents([
"Qwen 3 Embedding produces high-quality vector representations.",
"Ollama lets you run models locally on consumer hardware.",
"Rerankers improve precision by re-scoring retrieved documents.",
"RAG pipelines combine retrieval and generation for better answers.",
"Vector databases like ChromaDB and Qdrant store embeddings efficiently.",
])
# Ask a question
answer = rag.query("How do rerankers improve RAG pipelines?")
print(answer)
Step 6: Use with LangChain or LlamaIndex
LangChain Integration
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Create embeddings instance
embeddings = OllamaEmbeddings(
model="qwen3-embedding:4b",
base_url="http://localhost:11434",
)
# Use with ChromaDB
vectorstore = Chroma.from_texts(
texts=["doc1", "doc2", "doc3"],
embedding=embeddings,
collection_name="my_collection",
)
# Search
results = vectorstore.similarity_search("my query", k=5)
LlamaIndex Integration
from llama_index.embeddings.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(
model_name="qwen3-embedding:4b",
base_url="http://localhost:11434",
)
# Use in a LlamaIndex pipeline
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(
documents,
embed_model=embed_model,
)
Performance Tuning
Batch Size Optimization
For large document sets, adjust batch sizes to balance speed and memory:
# Process documents in batches
batch_size = 32
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
embeddings = ollama.embed(model="qwen3-embedding:4b", input=batch)
Hardware Recommendations
| VRAM | Recommended Setup |
|---|---|
| 4 GB | Embedding: 0.6B, Reranker: none, LLM: 1.5B |
| 8 GB | Embedding: 1.5B, Reranker: 1.5B, LLM: 4B |
| 16 GB | Embedding: 4B, Reranker: 4B, LLM: 8B |
| 24 GB | Embedding: 8B, Reranker: 4B, LLM: 14B |
Quantization
Ollama models are typically available in quantized formats. For embedding models, higher precision matters more:
# Use Q8 quantization for embedding models (better quality)
ollama pull qwen3-embedding:4b-q8_0
# Q4 is fine for the reranker and LLM
ollama pull qwen3-reranker:4b-q4_K_M
Troubleshooting
| Issue | Solution |
|---|---|
| "Model not found" | Run ollama pull qwen3-embedding:4b |
| Out of memory | Use a smaller model variant or increase swap |
| Slow embedding speed | Reduce batch size or use GPU acceleration |
| Low retrieval quality | Upgrade to the 8B embedding model |
| Inconsistent reranker scores | Ensure temperature is set to 0 |
Conclusion
Qwen 3 Embedding and Reranker models paired with Ollama give you a fully local, privacy-respecting RAG pipeline that rivals cloud-based solutions. The 4B variants offer an excellent balance of quality and performance, and the larger 8B model pushes into state-of-the-art territory.
If your RAG pipeline involves generating media content like images, videos, or audio based on retrieved information, Hypereal AI provides fast, affordable APIs for AI media generation. Combine local retrieval with cloud-based generation for a powerful end-to-end AI pipeline.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
