Mosaic AI Agent Framework & Agent Evaluation demo
This tutorial shows you how to build, deploy, and evaluate a RAG application using Mosaic AI Agent Framework (AWS | Azure) and Mosaic AI Agent Evaluation (AWS | Azure). In this tutorial, you:
- Build a vector search index using sample data chunks.
- Deploy a RAG application built with Agent Framework.
- Evaluate the quality of the application with Agent Evaluation and MLflow.
In this example, you build a RAG chatbot that can answer questions using information from Databricks public documentation (AWS | Azure).
Requirements
- This notebook requires a single-user cluster (AWS | Azure) running on Databricks Runtime 14.3 and above.
- Agent Framework and Agent Evaluation are only available on Amazon Web Services and Azure cloud platforms.
Databricks features used in this demo:
- Agent Framework (AWS | Azure) - An SDK used to quickly and safely build high-quality RAG applications.
- Agent Evaluation (AWS | Azure) - AI-assisted tools that help evaluate if outputs are high-quality. Include an intuitive UI-based review app to get feedback from human stakeholders.
- Mosaic AI Model Serving (AWS | Azure) Hosts the application's logic as a production-ready, scalable REST API.
- MLflow (AWS | Azure) Tracks and manages the application lifecycle, including evaluation results and application code/config
Install dependencies
Install the necessary dependencies and specify versions for compatibility.
Setup: Load the necessary data and code from the Databricks Cookbook repo
Clone the Generative AI cookbook repo from https://github.com/databricks/genai-cookbook
into a folder genai-cookbook
in the same folder as this notebook using a Git Folder (AWS | Azure).
Alternatively, you can manually clone the Git repo https://github.com/databricks/genai-cookbook
to a folder genai-cookbook
.
Unity Catalog and application setup
Set the catalog and schema where the following resources will be registered:
UC_CATALOG
andUC_SCHEMA
: Unity Catalog (AWS | Azure) and a Schema where the output Delta tables and Vector Search indexes are storedUC_MODEL_NAME
: Unity Catalog location to log and store the chain's modelVECTOR_SEARCH_ENDPOINT
: Vector Search Endpoint (AWS | Azure) to host the vector index
You must have USE CATALOG
privilege on the catalog, and CREATE MODEL
and USE SCHEMA
privileges on the schema.
Change the catalog and schema here if necessary. Any missing resources will be created in the next step.
Create the UC Catalog, UC Schema, and Vector Search endpoint
Check if the UC resources exist. Create the resources if they don't exist.
Build and deploy the application
The following is a high-level overview of the architecture you will deploy:
- Data preparation
- Copy the sample data to Delta table.
- Create a Vector Search index using the
databricks-gte-large-en
foundation embedding model.
- Inferences
- Configure the chain, register the chain as an MLflow model, and set up trace logging.
- Register the application in Unity Catalog.
- Deploy the chain.

Configure chain parameters
Databricks makes it easy to parameterize your chain with MLflow Model Configurations. Later, you can tune your application by adjusting parameters such as the system prompt or retrieval settings.
This demo keeps configurations to a minimum, but most applications will include many more parameters to tune.
Evaluate the application
Once the application is deployed, you can evaluate its quality.
- Human reviewers can use the review app to interact with the application and provide feedback on responses.
- Chain metrics provide quality metrics such as latency and token use.
- LLM judges use external large language models to analyze the output of your application and judge the quality of retrieved chunks and generated responses.
Get feedback from human reviewers
Have domain experts test the bot by chatting with it and providing correct answers when the bot doesn't respond properly. This is a critical step to build or improve your evaluation dataset.
Your evaluation dataset forms the basis of your development workflow to improve quality: identify the root causes of quality issues and then objectively measure the impact of your fixes.
The application automatically captures all stakeholder questions, stakeholder feedback, bot responses, and MLflow traces into Delta tables.
Your domain experts do NOT need to have Databricks workspace access - you can assign permissions to any user in your SSO if you have enabled SCIM (AWS | Azure).

Run Evaluate to get chain metrics
Use Agent Evaluation's specialized AI evaluators to assess chain performance without the need for human reviewers. Agent Evaluation is integrated into mlflow.evaluate(...)
all you need to do is pass model_type="databricks-agent"
.
There are three types of evaluation metrics:
Ground truth based: Assess performance based on known correct answers. Compare the RAG application’s retrieved documents or generated outputs to the ground truth documents and answers recorded in the evaluation set.
LLM judge-based: A separate LLM acts as a judge to evaluate the RAG application’s retrieval and response quality. This approach automates evaluation across numerous dimensions.
Trace metrics: Metrics computed using the agent trace help determine quantitative metrics like agent cost and latency.

Run evaluation
Next steps
Code-based quickstarts
Time required | Outcome | Link |
---|---|---|
🕧🕧 30 minutes |
Comprehensive quality/cost/latency evaluation of your proof of concept app | - Evaluate your proof of concept (AWS | Azure) - Identify the root causes of quality issues (AWS | Azure) |
Browse the code samples
Open the ./genai-cookbook/rag_app_sample_code
folder this notebook synced to your Workspace. Documentation here (AWS | Azure).
Read the Generative AI Cookbook (AWS | Azure)
The Databricks Generative AI Cookbook is a definitive how-to guide for building high-quality generative AI applications. High-quality applications are applications that:
- Accurate: provide correct responses
- Safe: do not deliver harmful or insecure responses
- Governed: respect data permissions and access controls and track lineage