Skip to main content

Build and trace retriever tools for unstructured data

Use the Mosaic AI Agent Framework to build tools that let AI agents query unstructured data such as a collection of documents. This page shows how to:

To learn more about agent tools, see AI agent tools.

Locally develop Vector Search retriever tools with AI Bridge

The fastest way to start building a Databricks Vector Search retriever tool is to develop and test it locally using Databricks AI Bridge packages like databricks-langchain and databricks-openai.

Install the latest version of databricks-langchain which includes Databricks AI Bridge.

Bash
%pip install --upgrade databricks-langchain

The following code prototypes a retriever tool that queries a hypothetical vector search index and binds it to an LLM locally so you can test its tool-calling behavior.

Provide a descriptive tool_description to help the agent understand the tool and determine when to invoke it.

Python
from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.my_databricks_docs_index",
tool_name="databricks_docs_retriever",
tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-claude-3-7-sonnet")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

For scenarios that use either direct-access indexes or Delta Sync indexes using self-managed embeddings, you must configure the VectorSearchRetrieverTool and specify a custom embedding model and text column. See options for providing embeddings.

The following example shows you how to configure a VectorSearchRetrieverTool with columns and embedding keys.

Python
from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
num_results=5, # Max number of documents to return
columns=["primary_key", "text_column"], # List of columns to include in the search
filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
query_type="ANN", # Query type ("ANN" or "HYBRID").
tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

For additional details, see the API docs for VectorSearchRetrieverTool.

Once your local tool is ready, you can directly productionize it as part of your agent code, or migrate it to a Unity Catalog function, which provides better discoverability and governance but has certain limitations.

The following section shows you how to migrate the retriever to a Unity Catalog function.

Vector Search retriever tool with Unity Catalog functions

You can create a Unity Catalog function that wraps a Mosaic AI Vector Search index query. This approach:

  • Supports production use cases with governance and discoverability
  • Uses the vector_search() SQL function under the hood
  • Supports automatic MLflow tracing
    • You must align the function's output to the MLflow retriever schema by using the page_content and metadata aliases.
    • Any additional metadata columns must be added to the metadata column using the SQL map function, rather than as top-level output keys.

Run the following code in a notebook or SQL editor to create the function:

SQL
CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
-- The agent uses this comment to determine how to generate the query string parameter.
query STRING
COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
chunked_text as page_content,
map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
vector_search(
-- Specify your Vector Search index name here
index => 'catalog.schema.databricks_docs_index',
query => query,
num_results => 5
)

To use this retriever tool in your AI agent, wrap it with UCFunctionToolkit. This enables automatic tracing through MLflow by automatically generating RETRIEVER span types in MLflow logs.

Python
from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
function_names=[
"main.default.databricks_docs_vector_search"
]
)
tools = toolkit.tools

Unity Catalog retriever tools have the following caveats:

  • SQL clients may limit the maximum number of rows or bytes returned. To prevent data truncation, you should truncate column values returned by the UDF. For example, you could use substring(chunked_text, 0, 8192) to reduce the size of large content columns and avoid row truncation during execution.
  • Since this tool is a wrapper for the vector_search() function, it is subject to the same limitations as the vector_search() function. See Limitations.

For more information about UCFunctionToolkit see Unity Catalog documentation.

Retriever that queries a vector index hosted outside of Databricks

If your vector index is hosted outside of Databricks, you can create a Unity Catalog connection to connect to the external service and use the connection in your agent code. See Connect AI agent tools to external services.

The following example creates a retriever that calls a vector index hosted outside of Databricks for a PyFunc-flavored agent.

  1. Create a Unity Catalog Connection to the external service, in this case, Azure.

    SQL
    CREATE CONNECTION ${connection_name}
    TYPE HTTP
    OPTIONS (
    host 'https://example.search.windows.net',
    base_path '/',
    bearer_token secret ('<secret-scope>','<secret-key>')
    );
  2. Define the retriever tool in agent code using the Unity Catalog connection. This example uses MLflow decorators to enable agent tracing.

    note

    To conform to the MLflow retriever schema, the retriever function should return a List[Document] object and use the metadata field in the Document class to add additional attributes to the returned document, such as doc_uri and similarity_score. See MLflow Document.

    Python
    import mlflow
    import json

    from mlflow.entities import Document
    from typing import List, Dict, Any
    from dataclasses import asdict

    class VectorSearchRetriever:
    """
    Class using Databricks Vector Search to retrieve relevant documents.
    """

    def __init__(self):
    self.azure_search_index = "hotels_vector_index"

    @mlflow.trace(span_type="RETRIEVER", name="vector_search")
    def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
    query: Search query.
    score_threshold: Score threshold to use for the query.

    Returns:
    List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
    "count": true,
    "select": "HotelId, HotelName, Description, Category",
    "vectorQueries": [
    {
    "vector": query_vector,
    "k": 7,
    "fields": "DescriptionVector",
    "kind": "vector",
    "exhaustive": true,
    }
    ],
    }

    response = (
    WorkspaceClient()
    .serving_endpoints.http_request(
    conn=connection_name,
    method=ExternalFunctionRequestHttpMethod.POST,
    path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
    json=json,
    )
    .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

    @mlflow.trace(span_type="PARSER")
    def convert_vector_search_to_documents(
    self, vs_results, score_threshold
    ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
    score = item.get("@search.score", 0)

    if score >= score_threshold:
    metadata = {
    "score": score,
    "HotelName": item.get("HotelName"),
    "Category": item.get("Category"),
    }

    doc = Document(
    page_content=item.get("Description", ""),
    metadata=metadata,
    id=item.get("HotelId"),
    )
    docs.append(doc)

    return docs
  3. To run the retriever, run the following Python code. You can optionally include Vector Search filters in the request to filter results.

    Python
    retriever = VectorSearchRetriever()
    query = [0.01944167, 0.0040178085 . . . TRIMMED FOR BREVITY 010858015, -0.017496133]
    results = retriever(query, score_threshold=0.1)

Add tracing to a retriever

Add MLflow tracing to monitor and debug your retriever. Tracing lets you view inputs, outputs, and metadata for each step of execution.

The previous example adds the @mlflow.trace decorator to both the __call__ and parsing methods. The decorator creates a span that starts when the function is invoked and ends when it returns. MLflow automatically records the function's input and output and any exceptions raised.

note

LangChain, LlamaIndex, and OpenAI library users can use MLflow auto logging in addition to manually defining traces with the decorator. See Add MLflow Tracing to AI agents.

Python
import mlflow
from mlflow.entities import Document

## This code snippet has been truncated for brevity, see the full retriever example above
class VectorSearchRetriever:
...

# Create a RETRIEVER span. The span name must match the retriever schema name.
@mlflow.trace(span_type="RETRIEVER", name="vector_search")
def __call__(...) -> List[Document]:
...

# Create a PARSER span.
@mlflow.trace(span_type="PARSER")
def parse_results(...) -> List[Document]:
...

To ensure downstream applications such as Agent Evaluation and the AI Playground render the retriever trace correctly, make sure the decorator meets the following requirements:

  • Use span_type="RETRIEVER" and ensure the function returns a List[Document] object. See Retriever spans.
  • The trace name and the retriever_schema name must match to configure the trace correctly. See the following section to learn how to set the retriever schema.

Set retriever schema to ensure MLflow compatibility

If the trace returned from the retriever or span_type="RETRIEVER" does not conform to MLflow's standard retriever schema, you must manually map the returned schema to MLflow's expected fields. This ensures that MLflow can properly trace your retriever and render traces in downstream applications.

To set the retriever schema manually:

  1. Call mlflow.models.set_retriever_schema when you define your agent. Use set_retriever_schema to map the column names in the returned table to MLflow's expected fields such as primary_key, text_column, and doc_uri.

    Python
    # Define the retriever's schema by providing your column names
    mlflow.models.set_retriever_schema(
    name="vector_search",
    primary_key="chunk_id",
    text_column="text_column",
    doc_uri="doc_uri"
    # other_columns=["column1", "column2"],
    )
  2. Specify additional columns in your retriever's schema by providing a list of column names with the other_columns field.

  3. If you have multiple retrievers, you can define multiple schemas by using unique names for each retriever schema.

The retriever schema set during agent creation affects downstream applications and workflows, such as the review app and evaluation sets. Specifically, the doc_uri column serves as the primary identifier for documents returned by the retriever.

  • The review app displays the doc_uri to help reviewers assess responses and trace document origins. See Review App UI.
  • Evaluation sets use doc_uri to compare retriever results against predefined evaluation datasets to determine the retriever's recall and precision. See Evaluation sets.

Next steps

After building your retriever, the final step is integrating it into an AI agent definition. Learn how to add a tool to an agent, see Add Unity Catalog tools to agents.