Mosaic AI Agent Framework & Agent Evaluation demo

This tutorial shows you how to build, deploy, and evaluate a RAG application using Mosaic AI Agent Framework (AWS | Azure) and Mosaic AI Agent Evaluation (AWS | Azure). In this tutorial, you:

Build a vector search index using sample data chunks.
Deploy a RAG application built with Agent Framework.
Evaluate the quality of the application with Agent Evaluation and MLflow.

In this example, you build a RAG chatbot that can answer questions using information from Databricks public documentation (AWS | Azure).

Requirements

This notebook requires a single-user cluster (AWS | Azure) running on Databricks Runtime 14.3 and above.
Agent Framework and Agent Evaluation are only available on Amazon Web Services and Azure cloud platforms.

Databricks features used in this demo:

Agent Framework (AWS | Azure) - An SDK used to quickly and safely build high-quality RAG applications.
Agent Evaluation (AWS | Azure) - AI-assisted tools that help evaluate if outputs are high-quality. Include an intuitive UI-based review app to get feedback from human stakeholders.
Mosaic AI Model Serving (AWS | Azure) Hosts the application's logic as a production-ready, scalable REST API.
MLflow (AWS | Azure) Tracks and manages the application lifecycle, including evaluation results and application code/config

3

5

Unity Catalog and application setup

Set the catalog and schema where the following resources will be registered:

UC_CATALOG and UC_SCHEMA: Unity Catalog (AWS | Azure) and a Schema where the output Delta tables and Vector Search indexes are stored
UC_MODEL_NAME: Unity Catalog location to log and store the chain's model
VECTOR_SEARCH_ENDPOINT: Vector Search Endpoint (AWS | Azure) to host the vector index

You must have USE CATALOG privilege on the catalog, and CREATE MODEL and USE SCHEMA privileges on the schema.

Change the catalog and schema here if necessary. Any missing resources will be created in the next step.

7

9

12

13

16

18

20

Get feedback from human reviewers

Have domain experts test the bot by chatting with it and providing correct answers when the bot doesn't respond properly. This is a critical step to build or improve your evaluation dataset.

Your evaluation dataset forms the basis of your development workflow to improve quality: identify the root causes of quality issues and then objectively measure the impact of your fixes.

The application automatically captures all stakeholder questions, stakeholder feedback, bot responses, and MLflow traces into Delta tables.

Your domain experts do NOT need to have Databricks workspace access - you can assign permissions to any user in your SSO if you have enabled SCIM (AWS | Azure).

Run Evaluate to get chain metrics

Use Agent Evaluation's specialized AI evaluators to assess chain performance without the need for human reviewers. Agent Evaluation is integrated into mlflow.evaluate(...) all you need to do is pass model_type="databricks-agent".

There are three types of evaluation metrics:

Ground truth based: Assess performance based on known correct answers. Compare the RAG application’s retrieved documents or generated outputs to the ground truth documents and answers recorded in the evaluation set.
LLM judge-based: A separate LLM acts as a judge to evaluate the RAG application’s retrieval and response quality. This approach automates evaluation across numerous dimensions.
Trace metrics: Metrics computed using the agent trace help determine quantitative metrics like agent cost and latency.

25

27

Next steps

Code-based quickstarts

Time required	Outcome	Link
🕧🕧 30 minutes	Comprehensive quality/cost/latency evaluation of your proof of concept app	- Evaluate your proof of concept (AWS \| Azure) - Identify the root causes of quality issues (AWS \| Azure)

Browse the code samples

Open the ./genai-cookbook/rag_app_sample_code folder this notebook synced to your Workspace. Documentation here (AWS | Azure).

Read the Generative AI Cookbook (AWS | Azure)

The Databricks Generative AI Cookbook is a definitive how-to guide for building high-quality generative AI applications. High-quality applications are applications that:

Accurate: provide correct responses
Safe: do not deliver harmful or insecure responses
Governed: respect data permissions and access controls and track lineage

mosaic-ai-agents-demo(Python)

Mosaic AI Agent Framework & Agent Evaluation demo

Requirements

Databricks features used in this demo:

Install dependencies

Setup: Load the necessary data and code from the Databricks Cookbook repo

Unity Catalog and application setup

Create the UC Catalog, UC Schema, and Vector Search endpoint

Build and deploy the application

Configure chain parameters

Evaluate the application