Vector Search retrieval quality guide

This guide provides a systematic approach to improving retrieval quality for real-time RAG, search, and matching applications using Mosaic AI Vector Search. The recommendations are ordered from highest impact/lowest effort to lowest impact/highest effort.

Prerequisites: Establish evaluation framework

Before optimizing retrieval quality, you must have a reproducible evaluation system.

important

If you don't have evaluation in place, stop here and set it up first. Optimizing without measurement is guesswork.

Define latency requirements

Establish clear latency targets based on your use case:

RAG agents: Time to First Token (TTFT) target (for example, <2 sec)
Search bars: End-to-end latency to display results (for example, <100 msec)

Any optimization you try must meet these requirements.

Set up automated evaluation

Use one or more of the following approaches:

Existing golden dataset: Use your labeled query-answer pairs.
Synthetic evaluation set: Use Databricks synthetic data generation to auto-generate test cases from your documents.
Ground-truth free evaluation: Use Databricks Agent Evaluation judges to assess quality without labels.

The key is having some automated way to measure changes - perfect data isn't required. Focus on relative improvements as you test different strategies, not absolute scores. Even a small synthetic dataset can tell you if reranking improves quality by 15% or if hybrid search helps your specific use case.

Choose quality metrics

Choose your quality metrics based on your use case:

If recall matters most (need all relevant information):

RAG agents: Missing key context leads to incorrect answers or hallucinations.
Pharma clinical trial matching: Cannot miss eligible patients or relevant studies.
Financial compliance search: Need all relevant regulations, risk factors, or precedents.
Manufacturing root cause analysis: Must surface all related incidents and failure patterns.
Metric to track: Recall@k (for example, recall@10, recall@50).

If precision matters most (need only the most relevant results):

Entity resolution/fuzzy matching: Matching customer records, supplier names, or product SKUs across systems.
Financial services deduplication: Identifying duplicate transactions or accounts with high confidence.
Supply chain part matching: Finding exact or compatible components across catalogs.
Tech support knowledge base: Engineers need the exact solution in top results.
Metric to track: Precision@k (for example, precision@3, precision@10).

Balanced use cases (need both good recall and precision):

M&A due diligence: Can't miss risks (recall) but need relevant docs first (precision).
Patent prior art search: Comprehensive coverage with most relevant patents prioritized.
Customer 360 matching: Unifying customer data across multiple systems.

Step 1: Enable hybrid search

Combine keyword precision with semantic understanding.

When to use:

Users search with specific terms (product codes, technical terms).
Need exact match for certain queries.
Want fallback when semantic search misses obvious keyword matches.

Impact on metrics:

Improves recall by catching both semantic and keyword matches.
Improves precision for queries with specific terms.

Implementation: One-line change in Mosaic AI Vector Search.

Python
# Enable hybrid search
results = index.similarity_search(
    query_text="error code E404",
    query_type="HYBRID"  # Combines vector and keyword search
)

For more information, see Query a vector search index.

Step 2: Implement metadata filtering

This is your biggest lever for retrieval quality.

Filtering dramatically reduces search space and improves both precision and recall.

Impact on metrics:

Dramatically improves precision by eliminating irrelevant results.
Improves recall within the filtered subset.
Can reduce search space by 90%+.

Examples

Technical documentation: Filter by product version, component, or module.
Car manuals: Filter by make, model, year.
Customer support: Filter by product line, region, issue category.

Implementation

Python
# Vector Search with metadata filtering
results = index.similarity_search(
    query_text="brake system maintenance",
    filters='make = "Toyota" AND model = "Camry" AND year = 2023',
    num_results=10
)

Dynamic filter selection

Programmatic approach:

Python
# Parse query for filter criteria
def extract_filters(user_query):
    filter_parts = []
    if "Toyota" in user_query:
        filter_parts.append('make = "Toyota"')
    if "2023" in user_query:
        filter_parts.append('year = 2023')
    return " AND ".join(filter_parts) if filter_parts else None

Agent-based filtering with Databricks:

Python
from databricks_ai_bridge.agents.tools.vector_search import VectorSearchTool

# Create the vector search tool
vector_search_tool = VectorSearchTool(
    index_name="catalog.schema.car_manuals_index",
    # Optional: specify columns to return
    columns=["content", "make", "model", "year", "chunk_id"],
    # Optional: set number of results
    num_results=10,
    # Optional: add additional parameters as needed
    additional_parameters={
        "query_type": "HYBRID"  # Enable hybrid search
    }
)

# The tool automatically handles filter generation based on the agent's understanding
# Agent analyzes "brake issues in my 2023 Toyota Camry" and generates appropriate filters

# For LangChain agents:
from langchain.agents import create_react_agent

agent = create_react_agent(
    tools=[vector_search_tool],
    llm=your_llm,
    prompt=your_prompt
)

The agent automatically:

Extracts relevant entities from the query.
Generates appropriate SQL-like filter strings.
Executes the search with both semantic understanding and precise filtering.

Impact: Can reduce search space by 90%+ while improving relevance.

Step 3: Add reranking

One-line change for ~15% quality improvement.

Databricks provides a built-in reranker that's perfect for RAG agents.

Impact on metrics:

Boosts precision by achieving high recall with fewer candidates.
Works best when combined with techniques like hybrid search and filtering.

Implementation

Python
# Python SDK
results = index.similarity_search(
    query_text="How to create a Vector Search index",
    num_results=10,
    columns=["id", "text", "parent_doc_summary"],
    reranker={
        "model": "databricks_reranker",
        "parameters": {
            "columns_to_rerank": ["text", "parent_doc_summary"]
        }
    }
)

For more information, see Rerank query results.

When to use

Perfect for:

RAG agents (latency is dominated by LLM generation).
Quality-first applications.
Low-to-moderate QPS (~5 QPS out of the box).

Built-in reranker not suitable for:

High QPS applications (>5 QPS without additional scaling).
Real-time search bars requiring <100 msec latency.
Applications where 1.5s reranking time is unacceptable.

Performance: Reranks 50 results in ~1.5 seconds in typical workloads. As fast as ~250 msec for shorter chunks.

For low-latency/non-RAG use cases

Reranking can still provide significant quality improvements for search bars and high-QPS applications - you just need a faster reranker. Consider deploying a lightweight reranking model (for example, cross-encoder/ms-marco-TinyBERT-L-2-v2) as a custom model on Databricks Model Serving for sub-100 msec reranking.

Step 4: Improve data preparation

This section describes some techniques you can use to improve data preparation: chunking, parsing, adding semantic context, and cleaning data.

Chunking strategy

Chunk size optimization remains an active area of research. Recent work from DeepMind (LIMIT) shows embeddings can fail to capture basic information in long contexts, making this a nuanced decision.

Starting points for experimentation:

Python
# Common configurations to test
small_chunks = 256   # Better for precise fact retrieval
medium_chunks = 512  # Balanced approach
large_chunks = 1024  # More context per chunk

Key trade-offs to consider:

Smaller chunks: Better localization of specific information, but may lose context.
Larger chunks: More context preserved, but harder to pinpoint relevant information.
Context limits: Must fit within LLM context window when retrieving multiple chunks.

More impactful optimizations: Instead of over-optimizing chunk size, focus on:

Information extraction for metadata: Extract entities, topics, and categories to enable precise filtering.
High-quality parsing: Use ai_parse_document for clean, structured text.
Semantic metadata: Add document summaries and section headers to chunks.

Also consider the following advanced approaches. These techniques require more effort but can have a bigger impact:

Semantic chunking: Group sentences by similarity rather than fixed size.

Use embeddings to find natural semantic boundaries.
Keeps related ideas together.
Better context preservation.
See The ultimate guide to chunking strategies for RAG applications.

Parent-child chunking (small-to-big retrieval):

Python
# Record child and parent chunks in your source table
for parent_chunk in create_chunks(doc, size=2048):  # Large for context
    for child_chunk in create_chunks(parent_chunk, size=512):  # Small for precision
        source_table.append({"text": child_chunk, "parent_text": parent_chunk})

# Search children, return parents
results = index.similarity_search(
    query_text="Is attention all you need?",
    num_results=10,
    columns=["text", "parent_text"]
)

See LangChain parent document retriever docs.

Document parsing

For PDFs and complex documents, Databricks recommends using ai_parse_document for high-quality parsing. Poor parsing (missing tables, broken formatting) directly impacts retrieval quality.

Enrich with semantic metadata

Add semantic context to improve retrieval.

Why this works:

Provides additional semantic signal for embedding models.
Gives rerankers more context for scoring.
Helps with queries that reference document-level concepts.

Option 1: Include metadata in chunks

Python
# Prepend document summary to each chunk
chunk_with_context = f"""
Document: {doc_title}
Summary: {doc_summary}
Section: {section_name}
{chunk_content}
"""

Option 2: Store as separate metadata columns

Python
# Store semantic metadata for reranker to use
metadata = {
    "doc_summary": "Technical manual for brake system maintenance",
    "section": "Emergency brake adjustment procedures",
    "keywords": ["brake", "safety", "adjustment"]
}

important

This approach requires downstream processing to leverage the metadata:

For semantic metadata: Use reranking with columns_to_rerank parameter to consider these columns.
For keyword-only metadata: Use hybrid search (full-text mode) to match against these fields.

Data cleaning

Remove boilerplate (headers, footers, page numbers).
Preserve document structure (headings, lists, tables).
Maintain semantic boundaries when chunking.

Step 5: Query optimization

Query expansion

Generate multiple query variations to improve recall. See LangChain guide.

Impact: Improves recall by finding documents with different terminology.

Python
# Use LLM to expand query with synonyms and related terms
def expand_query(user_query):
    prompt = f"""Generate 3 variations of this search query including synonyms:
    Query: {user_query}
    Return only the variations, one per line."""

    variations = llm.generate(prompt).split('\n')

    # Search with original + variations
    all_results = []
    for query in [user_query] + variations:
        results = index.similarity_search(query_text=query, num_results=10)
        all_results.extend(results)

    # Deduplicate and return
    return deduplicate_results(all_results)

Example: "car maintenance" also searches "automobile repair", "vehicle servicing", "auto maintenance"

For more techniques, see:

Query reformulation

For complex queries, break down or rephrase. See OpenAI RAG strategies.

Multi-hop questions → Sequential searches
Ambiguous queries → Multiple specific searches
See Decomposition techniques

Step 6: Advanced prompting techniques

Prompt optimization

Use automatic prompt optimization techniques like MIPROv2 or GEPA (available in DSPy) to improve your prompts used for data preparation, query rewriting, or anywhere in your retrieval system. Agent Bricks incorporates GEPA for large performance improvements at low cost. See Building state-of-the-art enterprise agents 90x cheaper with automated prompt optimization.

For more information, see Reflective Prompt Evolution with GEPA.

Step 7: Adaptive retrieval strategies

React agent pattern

Build agents that can intelligently orchestrate retrieval:

Agent reasons about whether retrieval is needed.
Can reformulate queries based on initial results.
Combines retrieval with other tools (calculators, APIs, etc.).
Retry failed retrievals with modified queries.
Implement using Databricks Mosaic AI Agent Framework.

Agentic retrieval examples

Python
# Agent decides when to search and what filters to apply
# Based on conversation context and user intent
agent = create_agent(
    tools=[vector_search_tool, calculator, web_search],
    instructions="Retrieve relevant docs only when needed, apply appropriate filters"
)

Step 8: Fine-tune embedding models

First: Diagnose if you have an embedding problem

Quick test: Compare GTE vs OpenAI embeddings on Databricks.

Python
# Test with both embedding models
# Databricks native: gte-large-en-v1.5
gte_results = gte_index.similarity_search(query)

# OpenAI: text-embedding-3-large (3072 dims)
openai_results = openai_index.similarity_search(query)

# If OpenAI text-embedding-3-large significantly outperforms GTE:
# - Fine-tuning a smaller model could match or exceed OpenAI quality
# - You have an embedding model problem, not a data problem

Interpretation:

If text-embedding-3-large performs much better than gte-large-en, consider fine-tuning. You can achieve similar quality with smaller model.
If text-embedding-3-large performs approximately the same as gte-large-en, your problem isn't the embedding model. Focus on other optimizations.

When to fine-tune

important

Fine-tuning should be considered a last resort and should only be considered when the following criteria are met:

You've tried Steps 1-7.
OpenAI significantly outperforms GTE in your tests.
You have domain-specific vocabulary or use case.

note

You don't need labeled training data - you can use synthetic data generation as shown in Databricks' embedding fine-tuning blog.

Prerequisites: Establish evaluation framework​

Define latency requirements​

Set up automated evaluation​

Choose quality metrics​

Step 1: Enable hybrid search​

Step 2: Implement metadata filtering​

Examples​

Implementation​

Dynamic filter selection​

Step 3: Add reranking​

Implementation​

When to use​

For low-latency/non-RAG use cases​

Step 4: Improve data preparation​

Chunking strategy​

Document parsing​

Enrich with semantic metadata​

Data cleaning​

Step 5: Query optimization​

Query expansion​

Query reformulation​

Step 6: Advanced prompting techniques​

Prompt optimization​

Step 7: Adaptive retrieval strategies​

React agent pattern​

Agentic retrieval examples​

Step 8: Fine-tune embedding models​

First: Diagnose if you have an embedding problem​

When to fine-tune​

Prerequisites: Establish evaluation framework

Define latency requirements

Set up automated evaluation

Choose quality metrics

Step 1: Enable hybrid search

Step 2: Implement metadata filtering

Examples

Implementation

Dynamic filter selection

Step 3: Add reranking

Implementation

When to use

For low-latency/non-RAG use cases

Step 4: Improve data preparation

Chunking strategy

Document parsing

Enrich with semantic metadata

Data cleaning

Step 5: Query optimization

Query expansion

Query reformulation

Step 6: Advanced prompting techniques

Prompt optimization

Step 7: Adaptive retrieval strategies

React agent pattern

Agentic retrieval examples

Step 8: Fine-tune embedding models

First: Diagnose if you have an embedding problem

When to fine-tune