Build and trace retriever tools for unstructured data
Use the Mosaic AI Agent Framework to build tools that let AI agents query unstructured data such as a collection of documents. This page shows how to:
- Develop retrievers locally
- Create a retriever using Unity Catalog functions
- Query external vector indexes
- Add MLflow tracing for observability
To learn more about agent tools, see AI agent tools.
Locally develop Vector Search retriever tools with AI Bridge
The fastest way to start building a Databricks Vector Search retriever tool is to develop and test it locally using Databricks AI Bridge packages like databricks-langchain
and databricks-openai
.
- LangChain/LangGraph
- OpenAI
Install the latest version of databricks-langchain
which includes Databricks AI Bridge.
%pip install --upgrade databricks-langchain
The following code prototypes a retriever tool that queries a hypothetical vector search index and binds it to an LLM locally so you can test its tool-calling behavior.
Provide a descriptive tool_description
to help the agent understand the tool and determine when to invoke it.
from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks
# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.my_databricks_docs_index",
tool_name="databricks_docs_retriever",
tool_description="Retrieves information about Databricks products from official Databricks documentation."
)
# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")
# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-claude-3-7-sonnet")
llm_with_tools = llm.bind_tools([vs_tool])
# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")
For scenarios that use either direct-access indexes or Delta Sync indexes using self-managed embeddings, you must configure the VectorSearchRetrieverTool
and specify a custom embedding model and text column. See options for providing embeddings.
The following example shows you how to configure a VectorSearchRetrieverTool
with columns
and embedding
keys.
from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings
embedding_model = DatabricksEmbeddings(
endpoint="databricks-bge-large-en",
)
vs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
num_results=5, # Max number of documents to return
columns=["primary_key", "text_column"], # List of columns to include in the search
filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
query_type="ANN", # Query type ("ANN" or "HYBRID").
tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)
For additional details, see the API docs for VectorSearchRetrieverTool
.
Install the latest version of databricks-openai
which includes Databricks AI Bridge.
%pip install --upgrade databricks-openai
The following code prototypes a retriever that queries a hypothetical vector search index and integrates it with OpenAI's GPT models.
Provide a descriptive tool_description
to help the agent understand the tool and determine when to invoke it.
For more information on OpenAI recommendations for tools, see OpenAI Function Calling documentation.
from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json
# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)
# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.my_databricks_docs_index",
tool_name="databricks_docs_retriever",
tool_description="Retrieves information about Databricks products from official Databricks documentation"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Using the Databricks documentation, answer what is Spark?"
}
]
first_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=[dbvs_tool.tool]
)
# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"]) # For self-managed embeddings, optionally pass in openai_client=client
# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=[dbvs_tool.tool]
)
For scenarios that use either direct-access indexes or Delta Sync indexes using self-managed embeddings, you must configure the VectorSearchRetrieverTool
and specify a custom embedding model and text column. See options for providing embeddings.
The following example shows you how to configure a VectorSearchRetrieverTool
with columns
and embedding
keys.
from databricks_openai import VectorSearchRetrieverTool
vs_tool = VectorSearchRetrieverTool(
index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
num_results=5, # Max number of documents to return
columns=["primary_key", "text_column"], # List of columns to include in the search
filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
query_type="ANN", # Query type ("ANN" or "HYBRID").
tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)
For additional details, see the API docs for VectorSearchRetrieverTool
.
Once your local tool is ready, you can directly productionize it as part of your agent code, or migrate it to a Unity Catalog function, which provides better discoverability and governance but has certain limitations.
The following section shows you how to migrate the retriever to a Unity Catalog function.
Vector Search retriever tool with Unity Catalog functions
You can create a Unity Catalog function that wraps a Mosaic AI Vector Search index query. This approach:
- Supports production use cases with governance and discoverability
- Uses the vector_search() SQL function under the hood
- Supports automatic MLflow tracing
- You must align the function's output to the MLflow retriever schema by using the
page_content
andmetadata
aliases. - Any additional metadata columns must be added to the
metadata
column using the SQL map function, rather than as top-level output keys.
- You must align the function's output to the MLflow retriever schema by using the
Run the following code in a notebook or SQL editor to create the function:
CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
-- The agent uses this comment to determine how to generate the query string parameter.
query STRING
COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
chunked_text as page_content,
map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
vector_search(
-- Specify your Vector Search index name here
index => 'catalog.schema.databricks_docs_index',
query => query,
num_results => 5
)
To use this retriever tool in your AI agent, wrap it with UCFunctionToolkit
. This enables automatic tracing through MLflow by automatically generating RETRIEVER
span types in MLflow logs.
from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit
toolkit = UCFunctionToolkit(
function_names=[
"main.default.databricks_docs_vector_search"
]
)
tools = toolkit.tools
Unity Catalog retriever tools have the following caveats:
- SQL clients may limit the maximum number of rows or bytes returned. To prevent data truncation, you should truncate column values returned by the UDF. For example, you could use
substring(chunked_text, 0, 8192)
to reduce the size of large content columns and avoid row truncation during execution. - Since this tool is a wrapper for the
vector_search()
function, it is subject to the same limitations as thevector_search()
function. See Limitations.
For more information about UCFunctionToolkit
see Unity Catalog documentation.
Retriever that queries a vector index hosted outside of Databricks
If your vector index is hosted outside of Databricks, you can create a Unity Catalog connection to connect to the external service and use the connection in your agent code. See Connect AI agent tools to external services.
The following example creates a retriever that calls a vector index hosted outside of Databricks for a PyFunc-flavored agent.
-
Create a Unity Catalog Connection to the external service, in this case, Azure.
SQLCREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
host 'https://example.search.windows.net',
base_path '/',
bearer_token secret ('<secret-scope>','<secret-key>')
); -
Define the retriever tool in agent code using the Unity Catalog connection. This example uses MLflow decorators to enable agent tracing.
noteTo conform to the MLflow retriever schema, the retriever function should return a
List[Document]
object and use themetadata
field in the Document class to add additional attributes to the returned document, such asdoc_uri
andsimilarity_score
. See MLflow Document.Pythonimport mlflow
import json
from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict
class VectorSearchRetriever:
"""
Class using Databricks Vector Search to retrieve relevant documents.
"""
def __init__(self):
self.azure_search_index = "hotels_vector_index"
@mlflow.trace(span_type="RETRIEVER", name="vector_search")
def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
"""
Performs vector search to retrieve relevant chunks.
Args:
query: Search query.
score_threshold: Score threshold to use for the query.
Returns:
List of retrieved Documents.
"""
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod
json = {
"count": true,
"select": "HotelId, HotelName, Description, Category",
"vectorQueries": [
{
"vector": query_vector,
"k": 7,
"fields": "DescriptionVector",
"kind": "vector",
"exhaustive": true,
}
],
}
response = (
WorkspaceClient()
.serving_endpoints.http_request(
conn=connection_name,
method=ExternalFunctionRequestHttpMethod.POST,
path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
json=json,
)
.text
)
documents = self.convert_vector_search_to_documents(response, score_threshold)
return [asdict(doc) for doc in documents]
@mlflow.trace(span_type="PARSER")
def convert_vector_search_to_documents(
self, vs_results, score_threshold
) -> List[Document]:
docs = []
for item in vs_results.get("value", []):
score = item.get("@search.score", 0)
if score >= score_threshold:
metadata = {
"score": score,
"HotelName": item.get("HotelName"),
"Category": item.get("Category"),
}
doc = Document(
page_content=item.get("Description", ""),
metadata=metadata,
id=item.get("HotelId"),
)
docs.append(doc)
return docs -
To run the retriever, run the following Python code. You can optionally include Vector Search filters in the request to filter results.
Pythonretriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . . TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)
Add tracing to a retriever
Add MLflow tracing to monitor and debug your retriever. Tracing lets you view inputs, outputs, and metadata for each step of execution.
The previous example adds the @mlflow.trace decorator to both the __call__
and parsing methods. The decorator creates a span that starts when the function is invoked and ends when it returns. MLflow automatically records the function's input and output and any exceptions raised.
LangChain, LlamaIndex, and OpenAI library users can use MLflow auto logging in addition to manually defining traces with the decorator. See Add MLflow Tracing to AI agents.
import mlflow
from mlflow.entities import Document
## This code snippet has been truncated for brevity, see the full retriever example above
class VectorSearchRetriever:
...
# Create a RETRIEVER span. The span name must match the retriever schema name.
@mlflow.trace(span_type="RETRIEVER", name="vector_search")
def __call__(...) -> List[Document]:
...
# Create a PARSER span.
@mlflow.trace(span_type="PARSER")
def parse_results(...) -> List[Document]:
...
To ensure downstream applications such as Agent Evaluation and the AI Playground render the retriever trace correctly, make sure the decorator meets the following requirements:
- Use
span_type="RETRIEVER"
and ensure the function returns a List[Document] object. See Retriever spans. - The trace name and the
retriever_schema
name must match to configure the trace correctly. See the following section to learn how to set the retriever schema.
Set retriever schema to ensure MLflow compatibility
If the trace returned from the retriever or span_type="RETRIEVER"
does not conform to MLflow's standard retriever schema, you must manually map the returned schema to MLflow's expected fields. This ensures that MLflow can properly trace your retriever and render traces in downstream applications.
To set the retriever schema manually:
-
Call mlflow.models.set_retriever_schema when you define your agent. Use
set_retriever_schema
to map the column names in the returned table to MLflow's expected fields such asprimary_key
,text_column
, anddoc_uri
.Python# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
name="vector_search",
primary_key="chunk_id",
text_column="text_column",
doc_uri="doc_uri"
# other_columns=["column1", "column2"],
) -
Specify additional columns in your retriever's schema by providing a list of column names with the
other_columns
field. -
If you have multiple retrievers, you can define multiple schemas by using unique names for each retriever schema.
The retriever schema set during agent creation affects downstream applications and workflows, such as the review app and evaluation sets. Specifically, the doc_uri
column serves as the primary identifier for documents returned by the retriever.
- The review app displays the
doc_uri
to help reviewers assess responses and trace document origins. See Review App UI. - Evaluation sets use
doc_uri
to compare retriever results against predefined evaluation datasets to determine the retriever's recall and precision. See Evaluation sets.
Next steps
After building your retriever, the final step is integrating it into an AI agent definition. Learn how to add a tool to an agent, see Add Unity Catalog tools to agents.