mosaic-ai-agents-demo(Python)

Loading...

Mosaic AI Agent Framework & Agent Evaluation demo

This tutorial shows you how to build, deploy, and evaluate a RAG application using Mosaic AI Agent Framework (AWS | Azure) and Mosaic AI Agent Evaluation (AWS | Azure). In this tutorial, you:

  1. Build a vector search index using sample data chunks.
  2. Deploy a RAG application built with Agent Framework.
  3. Evaluate the quality of the application with Agent Evaluation and MLflow.

In this example, you build a RAG chatbot that can answer questions using information from Databricks public documentation (AWS | Azure).

Requirements

  • This notebook requires a single-user cluster (AWS | Azure) running on Databricks Runtime 14.3 and above.
  • Agent Framework and Agent Evaluation are only available on Amazon Web Services and Azure cloud platforms.

Databricks features used in this demo:

  • Agent Framework (AWS | Azure) - An SDK used to quickly and safely build high-quality RAG applications.
  • Agent Evaluation (AWS | Azure) - AI-assisted tools that help evaluate if outputs are high-quality. Include an intuitive UI-based review app to get feedback from human stakeholders.
  • Mosaic AI Model Serving (AWS | Azure) Hosts the application's logic as a production-ready, scalable REST API.
  • MLflow (AWS | Azure) Tracks and manages the application lifecycle, including evaluation results and application code/config

Install dependencies

Install the necessary dependencies and specify versions for compatibility.

3

Setup: Load the necessary data and code from the Databricks Cookbook repo

Clone the Generative AI cookbook repo from https://github.com/databricks/genai-cookbook into a folder genai-cookbook in the same folder as this notebook using a Git Folder (AWS | Azure).

Alternatively, you can manually clone the Git repo https://github.com/databricks/genai-cookbook to a folder genai-cookbook.

5

Unity Catalog and application setup

Set the catalog and schema where the following resources will be registered:

  • UC_CATALOG and UC_SCHEMA: Unity Catalog (AWS | Azure) and a Schema where the output Delta tables and Vector Search indexes are stored
  • UC_MODEL_NAME: Unity Catalog location to log and store the chain's model
  • VECTOR_SEARCH_ENDPOINT: Vector Search Endpoint (AWS | Azure) to host the vector index

You must have USE CATALOG privilege on the catalog, and CREATE MODEL and USE SCHEMA privileges on the schema.

Change the catalog and schema here if necessary. Any missing resources will be created in the next step.

7

Create the UC Catalog, UC Schema, and Vector Search endpoint

Check if the UC resources exist. Create the resources if they don't exist.

9

Build and deploy the application

The following is a high-level overview of the architecture you will deploy:

  1. Data preparation
    • Copy the sample data to Delta table.
    • Create a Vector Search index using the databricks-gte-large-en foundation embedding model.
  2. Inferences
    • Configure the chain, register the chain as an MLflow model, and set up trace logging.
    • Register the application in Unity Catalog.
    • Deploy the chain.
12

13

Configure chain parameters

Databricks makes it easy to parameterize your chain with MLflow Model Configurations. Later, you can tune your application by adjusting parameters such as the system prompt or retrieval settings.

This demo keeps configurations to a minimum, but most applications will include many more parameters to tune.

16

18

20

Evaluate the application

Once the application is deployed, you can evaluate its quality.

  • Human reviewers can use the review app to interact with the application and provide feedback on responses.
  • Chain metrics provide quality metrics such as latency and token use.
  • LLM judges use external large language models to analyze the output of your application and judge the quality of retrieved chunks and generated responses.

Get feedback from human reviewers

Have domain experts test the bot by chatting with it and providing correct answers when the bot doesn't respond properly. This is a critical step to build or improve your evaluation dataset.

Your evaluation dataset forms the basis of your development workflow to improve quality: identify the root causes of quality issues and then objectively measure the impact of your fixes.

The application automatically captures all stakeholder questions, stakeholder feedback, bot responses, and MLflow traces into Delta tables.

Your domain experts do NOT need to have Databricks workspace access - you can assign permissions to any user in your SSO if you have enabled SCIM (AWS | Azure).


Run Evaluate to get chain metrics

Use Agent Evaluation's specialized AI evaluators to assess chain performance without the need for human reviewers. Agent Evaluation is integrated into mlflow.evaluate(...) all you need to do is pass model_type="databricks-agent".

There are three types of evaluation metrics:

  • Ground truth based: Assess performance based on known correct answers. Compare the RAG application’s retrieved documents or generated outputs to the ground truth documents and answers recorded in the evaluation set.

  • LLM judge-based: A separate LLM acts as a judge to evaluate the RAG application’s retrieval and response quality. This approach automates evaluation across numerous dimensions.

  • Trace metrics: Metrics computed using the agent trace help determine quantitative metrics like agent cost and latency.


Define the evaluation set

This demo uses a toy 4-question evaluation dataset. To learn more about evaluation best practices, see best practices (AWS | Azure) .

25

Run evaluation

27

Next steps

Code-based quickstarts

Time required Outcome Link
🕧🕧
30 minutes
Comprehensive quality/cost/latency evaluation of your proof of concept app - Evaluate your proof of concept (AWS | Azure)
- Identify the root causes of quality issues (AWS | Azure)

Browse the code samples

Open the ./genai-cookbook/rag_app_sample_code folder this notebook synced to your Workspace. Documentation here (AWS | Azure).

Read the Generative AI Cookbook (AWS | Azure)

The Databricks Generative AI Cookbook is a definitive how-to guide for building high-quality generative AI applications. High-quality applications are applications that:

  1. Accurate: provide correct responses
  2. Safe: do not deliver harmful or insecure responses
  3. Governed: respect data permissions and access controls and track lineage