Create, evaluate, and deploy an AI agent

This notebook demonstrates how to use Mosaic AI to evaluate and improve the quality, cost, and latency of a tool-calling agent. It also shows you how to deploy the resulting agent to a web-based chat UI.

Using Mosiac AI Agent Evaluation (AWS | Azure), Agent Framework (AWS |Azure), MLflow (AWS | Azure) and Model Serving (AWS | Azure), this notebook:

Generates synthetic evaluation data from a document corpus.
Creates a tool-calling agent with a retriever tool.
Evaluates the agent's quality, cost, and latency across several foundational models.
Deploys the agent to a web-based chat app.

Requirements:

Use serverless compute or a cluster running Databricks Runtime 14.3 or above.
Databricks Serverless and Unity Catalog enabled.
CREATE MODEL access to a Unity Catalog schema.
Permission to create Model Serving endpoints.

For videos that go deeper into the capabilities, see this YouTube channel.

Want to use your own data?

Alternatively, if you already have a Databricks Vector Search index set up, you can use the version of this notebook designed to use your own data (AWS | Azure).

Step 1. Generate synthetic evaluation data to measure quality

Challenges addressed

How to start quality evaluation with diverse, representative data without SMEs spending months labeling?

What is happening?

We pass the documents to the Synthetic API along with a num_evals and prompt-like agent_description and question_guidelines to tailor the generated questions for our use case. This API uses a proprietary synthetic generation pipeline developed by Mosaic AI Research.
The API produces num_evals questions, each coupled with the source document and a list of facts, generated based on the source document. Each fact must be present in the agent's response for it to be considered correct.

Why does the the API generates a list of facts, rather than a fully written answer. This...

Makes SME review more efficient: by focusing on facts rather than a full response, they can review and edit more quickly.
Improves the accuracy of our proprietary LLM judges.

Interested in have your SMEs review the data? Check out a video demo of the Eval Set UI.

Step 2. Write the agent's code

Function-calling agent that uses a retriever tool

Challenges addressed

How do I track different versions of my agent's code or configuration?
How do I enable observability, monitoring, and debugging of my agent’s logic?

What is happening?

First, create a function-calling agent with access to a retriever tool using OpenAI SDK and Python code. To keep the demo simple, the retriever is a function that performs keyword lookup rather than a vector search index.

When creating your agent, you can either:

Generate template agent code from the AI Playground
Use a template from our Cookbook
Start from an example in popular frameworks such as LangGraph, AutoGen, LlamaIndex, and others.

NOTE: It is not necessary to understand how this agent works to understand the rest of this demo notebook.

A few things to note about the code:

The code is written to fc_agent.py in order to use MLflow Models from Code for logging, enabling easy tracking of each iteration as you tune the agent for quality.
The code is parameterized with an MLflow Model Configuration (AWS | Azure), enabling easy tuning of these parameters for quality improvement.
The code is wrapped in an MLflow ChatModel, making the agent's code deployment-ready so any iteration can be shared with stakeholders for testing.
The code implements MLflow Tracing (AWS | Azure) for unified observability during development and production. The same trace defined here will be logged for every production request post-deployment. For agent authoring frameworks like LangChain and LlamaIndex, you can perform tracing with one line of code: mlflow.langchain.autolog() or mlflow.llama_index.autolog()

Step 3. Evaluate the agent

Initial evaluation

Challenges addressed

What are the right metrics to evaluate quality? How do I trust the outputs of these metrics?
I need to evaluate many ideas - how do I…
- …run evaluation quickly so the majority of my time isn’t spent waiting?
- …quickly compare these different versions of my agent on quality, cost, and latency?
How do I quickly identify the root cause of any quality problems?

What is happening?

Now, run Agent Evaluation's proprietary LLM judges using the synthetic evaluation set to see the quality, cost, and latency of the agent and identify any root causes of quality issues. Agent Evaluation is tightly integrated with mlflow.evaluate().

Mosaic AI Research has invested signficantly in the quality AND speed of the LLM judges, optimizing the judges to agree with human raters. Read more details in our blog about how our judges outperform the competition.

After evaluation runs, click View Evaluation Results to open the MLflow UI for this Run. This lets you:

See summary metrics
See root cause analysis that identifies the most important issues to fix
Inspect individual responses to gain intuition about how the agent is performing
See the judge outputs to understand why the responses were graded as pass or fail
Compare between multiple runs to see how quality changed between experiments

You can also inspect the other tabs:

Overview lets you see the agent's configuration and parameters
Artifacts lets you see the agent's code

This UIs, coupled with the speed of evaluation, help you efficiently test your hypotheses to improve quality, letting you reach the production quality bar in less time.

Compare multiple LLMs on quality, cost, and latency

Challenges addressed

How to determine the foundational model that offers the right balance of quality, cost, and latency?

What is happening?

Normally, you would use the evaluation results to inform your hypotheses to improve quality, iteratively implementing, evaluating, and comparing each idea to the baseline. This demo assumes that you have fixed any root causes identified above and now want to optimize the agent for quality, cost, and latency.

Here, you run evaluation for several LLMs. After the evaluation runs, click View Evaluation Results to open the MLflow UI for one of the runs. In the MLFLow Evaluations UI, use the Compare to Run dropdown to select another run name. This comparison view helps you quickly identify where the agent got better, worse, or stayed the same.

Then, go to the MLflow Experiement page and click the chart icon in the upper left corner by Runs. Here, you can compare the models quantiatively across quality, cost, and latency metrics. The number of tokens used serves as a proxy for cost.

This helps you make informed tradeoffs in partnership with your business stakeholders about quality, cost, and latency. Further, you can use this view to provide quantitative updates to your stakeholders so they can follow your progress improving quality.

Step 4. [Optional] Deploy the agent

Deploy to pre-production for stakeholder testing

Challenges addressed

How do I quickly create a Chat UI for stakeholders to test the agent?
How do I track each piece of feedback and have it linked to what is happening in the bot so I can debug issues – without resorting to spreadsheets?

What is happening?

First, register one of the agent models that you logged above to Unity Catalog. Then, use Agent Framework to deploy the agent to Model serving using one line of code: agents.deploy().

The resulting Model Serving endpoint:

Is connected to the review app, which is a lightweight chat UI that can be shared with any user in your company, even if they don't have Databricks workspace access
Is integrated with AI Gateway so every request and response and its accompanying MLflow trace and user feedback is stored in an Inference Table

Optionally, you can turn on Agent Evaluation’s monitoring capabilities, which are unified with the offline experience used above, and get a ready-to-go dashboard that runs judges on a sample of the traffic.

Step 5. Deploy to production and monitor

Challenges addressed

How do I host my agent as a production ready, scalable service?
How do I execute tool code securely and ensure it respects my governance policies?
How do I enable telemetry or observability in development and production?
How do I monitor my agent’s quality at-scale in production? How do I quickly investigate and fix any quality issues?

With Agent Framework, production deployment is the same for pre-production and production - you already have a highly scalable REST API that can be intergated in your application. This API provides an endpoint to get agent responses and to pass back user feedback so you can use that feedback to improve quality.

To learn more about how monitoring works (in summary, Databricks has adapted a version of the above UIs and LLM judges for monitoring), read the documentation (AWS | Azure) or watch this 2 minute video.

10-minute-mosaic-ai-agent-demo(Python)

Create, evaluate, and deploy an AI agent

Requirements:

Want to use your own data?

Setup

Step 1. Generate synthetic evaluation data to measure quality

Load the docs corpus

Call API to generate synthetic evaluation data

Step 2. Write the agent's code

Function-calling agent that uses a retriever tool

Vibe check the agent

Step 3. Evaluate the agent

Initial evaluation

Compare multiple LLMs on quality, cost, and latency

Step 4. [Optional] Deploy the agent

Deploy to pre-production for stakeholder testing

Step 5. Deploy to production and monitor