Vector Search retrieval quality guide
This guide provides a systematic approach to improving retrieval quality for real-time RAG, search, and matching applications using Mosaic AI Vector Search. The recommendations are ordered from highest impact/lowest effort to lowest impact/highest effort.
Prerequisites: Establish evaluation framework
Before optimizing retrieval quality, you must have a reproducible evaluation system.
If you don't have evaluation in place, stop here and set it up first. Optimizing without measurement is guesswork.
Define latency requirements
Establish clear latency targets based on your use case:
- RAG agents: Time to First Token (TTFT) target (for example, <2 sec)
- Search bars: End-to-end latency to display results (for example, <100 msec)
Any optimization you try must meet these requirements.
Set up automated evaluation
Use one or more of the following approaches:
- Existing golden dataset: Use your labeled query-answer pairs.
- Synthetic evaluation set: Use Databricks synthetic data generation to auto-generate test cases from your documents.
- Ground-truth free evaluation: Use Databricks Agent Evaluation judges to assess quality without labels.
The key is having some automated way to measure changes - perfect data isn't required. Focus on relative improvements as you test different strategies, not absolute scores. Even a small synthetic dataset can tell you if reranking improves quality by 15% or if hybrid search helps your specific use case.
Choose quality metrics
Choose your quality metrics based on your use case:
If recall matters most (need all relevant information):
- RAG agents: Missing key context leads to incorrect answers or hallucinations.
- Pharma clinical trial matching: Cannot miss eligible patients or relevant studies.
- Financial compliance search: Need all relevant regulations, risk factors, or precedents.
- Manufacturing root cause analysis: Must surface all related incidents and failure patterns.
- Metric to track: Recall@k (for example, recall@10, recall@50).
If precision matters most (need only the most relevant results):
- Entity resolution/fuzzy matching: Matching customer records, supplier names, or product SKUs across systems.
- Financial services deduplication: Identifying duplicate transactions or accounts with high confidence.
- Supply chain part matching: Finding exact or compatible components across catalogs.
- Tech support knowledge base: Engineers need the exact solution in top results.
- Metric to track: Precision@k (for example, precision@3, precision@10).
Balanced use cases (need both good recall and precision):
- M&A due diligence: Can't miss risks (recall) but need relevant docs first (precision).
- Patent prior art search: Comprehensive coverage with most relevant patents prioritized.
- Customer 360 matching: Unifying customer data across multiple systems.
Step 1: Enable hybrid search
Combine keyword precision with semantic understanding.
When to use:
- Users search with specific terms (product codes, technical terms).
- Need exact match for certain queries.
- Want fallback when semantic search misses obvious keyword matches.
Impact on metrics:
- Improves recall by catching both semantic and keyword matches.
- Improves precision for queries with specific terms.
Implementation: One-line change in Mosaic AI Vector Search.
# Enable hybrid search
results = index.similarity_search(
query_text="error code E404",
query_type="HYBRID" # Combines vector and keyword search
)
For more information, see Query a vector search index.
Step 2: Implement metadata filtering
This is your biggest lever for retrieval quality.
Filtering dramatically reduces search space and improves both precision and recall.
Impact on metrics:
- Dramatically improves precision by eliminating irrelevant results.
- Improves recall within the filtered subset.
- Can reduce search space by 90%+.
Examples
- Technical documentation: Filter by product version, component, or module.
- Car manuals: Filter by make, model, year.
- Customer support: Filter by product line, region, issue category.
Implementation
# Vector Search with metadata filtering
results = index.similarity_search(
query_text="brake system maintenance",
filters='make = "Toyota" AND model = "Camry" AND year = 2023',
num_results=10
)
Dynamic filter selection
Programmatic approach:
# Parse query for filter criteria
def extract_filters(user_query):
filter_parts = []
if "Toyota" in user_query:
filter_parts.append('make = "Toyota"')
if "2023" in user_query:
filter_parts.append('year = 2023')
return " AND ".join(filter_parts) if filter_parts else None
Agent-based filtering with Databricks:
from databricks_ai_bridge.agents.tools.vector_search import VectorSearchTool
# Create the vector search tool
vector_search_tool = VectorSearchTool(
index_name="catalog.schema.car_manuals_index",
# Optional: specify columns to return
columns=["content", "make", "model", "year", "chunk_id"],
# Optional: set number of results
num_results=10,
# Optional: add additional parameters as needed
additional_parameters={
"query_type": "HYBRID" # Enable hybrid search
}
)
# The tool automatically handles filter generation based on the agent's understanding
# Agent analyzes "brake issues in my 2023 Toyota Camry" and generates appropriate filters
# For LangChain agents:
from langchain.agents import create_react_agent
agent = create_react_agent(
tools=[vector_search_tool],
llm=your_llm,
prompt=your_prompt
)
The agent automatically:
- Extracts relevant entities from the query.
- Generates appropriate SQL-like filter strings.
- Executes the search with both semantic understanding and precise filtering.
Impact: Can reduce search space by 90%+ while improving relevance.
Step 3: Add reranking
One-line change for ~15% quality improvement.
Databricks provides a built-in reranker that's perfect for RAG agents.
Impact on metrics:
- Boosts precision by achieving high recall with fewer candidates.
- Works best when combined with techniques like hybrid search and filtering.
Implementation
# Python SDK
results = index.similarity_search(
query_text="How to create a Vector Search index",
num_results=10,
columns=["id", "text", "parent_doc_summary"],
reranker={
"model": "databricks_reranker",
"parameters": {
"columns_to_rerank": ["text", "parent_doc_summary"]
}
}
)
For more information, see Rerank query results.
When to use
Perfect for:
- RAG agents (latency is dominated by LLM generation).
- Quality-first applications.
- Low-to-moderate QPS (~5 QPS out of the box).
Built-in reranker not suitable for:
- High QPS applications (>5 QPS without additional scaling).
- Real-time search bars requiring <100 msec latency.
- Applications where 1.5s reranking time is unacceptable.
Performance: Reranks 50 results in ~1.5 seconds in typical workloads. As fast as ~250 msec for shorter chunks.
For low-latency/non-RAG use cases
Reranking can still provide significant quality improvements for search bars and high-QPS applications - you just need a faster reranker. Consider deploying a lightweight reranking model (for example, cross-encoder/ms-marco-TinyBERT-L-2-v2) as a custom model on Databricks Model Serving for sub-100 msec reranking.
Step 4: Improve data preparation
This section describes some techniques you can use to improve data preparation: chunking, parsing, adding semantic context, and cleaning data.
Chunking strategy
Chunk size optimization remains an active area of research. Recent work from DeepMind (LIMIT) shows embeddings can fail to capture basic information in long contexts, making this a nuanced decision.
Starting points for experimentation:
# Common configurations to test
small_chunks = 256 # Better for precise fact retrieval
medium_chunks = 512 # Balanced approach
large_chunks = 1024 # More context per chunk
Key trade-offs to consider:
- Smaller chunks: Better localization of specific information, but may lose context.
- Larger chunks: More context preserved, but harder to pinpoint relevant information.
- Context limits: Must fit within LLM context window when retrieving multiple chunks.
More impactful optimizations: Instead of over-optimizing chunk size, focus on:
- Information extraction for metadata: Extract entities, topics, and categories to enable precise filtering.
- High-quality parsing: Use ai_parse_document for clean, structured text.
- Semantic metadata: Add document summaries and section headers to chunks.
Also consider the following advanced approaches. These techniques require more effort but can have a bigger impact:
Semantic chunking: Group sentences by similarity rather than fixed size.
- Use embeddings to find natural semantic boundaries.
- Keeps related ideas together.
- Better context preservation.
- See The ultimate guide to chunking strategies for RAG applications.
Parent-child chunking (small-to-big retrieval):
# Record child and parent chunks in your source table
for parent_chunk in create_chunks(doc, size=2048): # Large for context
for child_chunk in create_chunks(parent_chunk, size=512): # Small for precision
source_table.append({"text": child_chunk, "parent_text": parent_chunk})
# Search children, return parents
results = index.similarity_search(
query_text="Is attention all you need?",
num_results=10,
columns=["text", "parent_text"]
)
See LangChain parent document retriever docs.
Document parsing
For PDFs and complex documents, Databricks recommends using ai_parse_document for high-quality parsing. Poor parsing (missing tables, broken formatting) directly impacts retrieval quality.
Enrich with semantic metadata
Add semantic context to improve retrieval.
Why this works:
- Provides additional semantic signal for embedding models.
- Gives rerankers more context for scoring.
- Helps with queries that reference document-level concepts.
Option 1: Include metadata in chunks
# Prepend document summary to each chunk
chunk_with_context = f"""
Document: {doc_title}
Summary: {doc_summary}
Section: {section_name}
{chunk_content}
"""
Option 2: Store as separate metadata columns
# Store semantic metadata for reranker to use
metadata = {
"doc_summary": "Technical manual for brake system maintenance",
"section": "Emergency brake adjustment procedures",
"keywords": ["brake", "safety", "adjustment"]
}
This approach requires downstream processing to leverage the metadata:
- For semantic metadata: Use reranking with
columns_to_rerankparameter to consider these columns. - For keyword-only metadata: Use hybrid search (full-text mode) to match against these fields.
Data cleaning
- Remove boilerplate (headers, footers, page numbers).
- Preserve document structure (headings, lists, tables).
- Maintain semantic boundaries when chunking.
Step 5: Query optimization
Query expansion
Generate multiple query variations to improve recall. See LangChain guide.
Impact: Improves recall by finding documents with different terminology.
# Use LLM to expand query with synonyms and related terms
def expand_query(user_query):
prompt = f"""Generate 3 variations of this search query including synonyms:
Query: {user_query}
Return only the variations, one per line."""
variations = llm.generate(prompt).split('\n')
# Search with original + variations
all_results = []
for query in [user_query] + variations:
results = index.similarity_search(query_text=query, num_results=10)
all_results.extend(results)
# Deduplicate and return
return deduplicate_results(all_results)
Example: "car maintenance" also searches "automobile repair", "vehicle servicing", "auto maintenance"
For more techniques, see:
Query reformulation
For complex queries, break down or rephrase. See OpenAI RAG strategies.
- Multi-hop questions → Sequential searches
- Ambiguous queries → Multiple specific searches
- See Decomposition techniques
Step 6: Advanced prompting techniques
Prompt optimization
Use automatic prompt optimization techniques like MIPROv2 or GEPA (available in DSPy) to improve your prompts used for data preparation, query rewriting, or anywhere in your retrieval system. Agent Bricks incorporates GEPA for large performance improvements at low cost. See Building state-of-the-art enterprise agents 90x cheaper with automated prompt optimization.
For more information, see Reflective Prompt Evolution with GEPA.
Step 7: Adaptive retrieval strategies
React agent pattern
Build agents that can intelligently orchestrate retrieval:
- Agent reasons about whether retrieval is needed.
- Can reformulate queries based on initial results.
- Combines retrieval with other tools (calculators, APIs, etc.).
- Retry failed retrievals with modified queries.
- Implement using Databricks Mosaic AI Agent Framework.
Agentic retrieval examples
# Agent decides when to search and what filters to apply
# Based on conversation context and user intent
agent = create_agent(
tools=[vector_search_tool, calculator, web_search],
instructions="Retrieve relevant docs only when needed, apply appropriate filters"
)
Step 8: Fine-tune embedding models
First: Diagnose if you have an embedding problem
Quick test: Compare GTE vs OpenAI embeddings on Databricks.
# Test with both embedding models
# Databricks native: gte-large-en-v1.5
gte_results = gte_index.similarity_search(query)
# OpenAI: text-embedding-3-large (3072 dims)
openai_results = openai_index.similarity_search(query)
# If OpenAI text-embedding-3-large significantly outperforms GTE:
# - Fine-tuning a smaller model could match or exceed OpenAI quality
# - You have an embedding model problem, not a data problem
Interpretation:
- If
text-embedding-3-largeperforms much better thangte-large-en, consider fine-tuning. You can achieve similar quality with smaller model. - If
text-embedding-3-largeperforms approximately the same asgte-large-en, your problem isn't the embedding model. Focus on other optimizations.
When to fine-tune
Fine-tuning should be considered a last resort and should only be considered when the following criteria are met:
- You've tried Steps 1-7.
- OpenAI significantly outperforms GTE in your tests.
- You have domain-specific vocabulary or use case.
You don't need labeled training data - you can use synthetic data generation as shown in Databricks' embedding fine-tuning blog.