databricks-logo

agent-monitoring-example-no-feedback

(Python)
Loading...

Use Lakehouse Monitoring for GenAI to monitor your production agent

This notebook demonstrates how to monitor a deployed GenAI app / Agent using Lakehouse Monitoring for GenAI. It will:

  1. Deploy a "hello world" agent using Agent Framework.
  2. Configure quality monitoring using Agent Evaluation's LLM judges.
  3. Send sample traffic to the deployed endpoint.

Lakehouse Monitoring for GenAI allows you to:

  • Track quality and operational performance (latency, request volume, errors, etc.).
  • Run LLM-based evaluations on production traffic to detect drift or regressions using Agent Evaluation's LLM judges
  • Deep dive into individual requests to debug and improve agent responses.
  • Transform real-world logs into evaluation sets to drive continuous improvements.

Note: When you deploy agents authored with ChatAgent using Agent Frameworks' agents.deploy(...), basic monitoring is automatically configured with operational metrics (request volume, latency, error rate, etc). You can optionally configure quality metrics using Agent Evaluation's propietary LLM judges.

Install dependencies
%pip install -U -qqqq databricks-agents>=0.17.0 databricks-sdk[openai] backoff uv
dbutils.library.restartPython()
Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

Select a Unity Catalog schema

Ensure you have CREATE TABLE and CREATE MODEL access in this schema. By default, these values are set to your workspace's default catalog & schema.

4
# Get the workspace default UC catalog / schema
uc_default_location = spark.sql("select current_catalog() as current_catalog, current_schema() as current_schema").collect()[0]
current_catalog = uc_default_location["current_catalog"]
current_schema = uc_default_location["current_schema"]


# Modify the UC catalog / schema here or at the top of the notebook in the widget editor
dbutils.widgets.text("uc_catalog", current_catalog)
dbutils.widgets.text("uc_schema", current_schema)
UC_CATALOG = dbutils.widgets.get("uc_catalog")
UC_SCHEMA = dbutils.widgets.get("uc_schema")
UC_PREFIX = f"{UC_CATALOG}.{UC_SCHEMA}"

Agent creation and deployment

In this section, we will:

  1. Create a simple agent by using Llama 70B
  2. Log the agent using MLflow
  3. Deploy the agent. This will automatically setup basic monitoring that tracks request volume, latency, and errors.

You can skip this step if you already have a deployed agent.

6
%%writefile hello_world_agent.py
from typing import Any, Generator, Optional

import mlflow
from databricks.sdk import WorkspaceClient
from mlflow.entities import SpanType
from mlflow.pyfunc.model import ChatAgent
from mlflow.types.agent import (
    ChatAgentChunk,
    ChatAgentMessage,
    ChatAgentResponse,
    ChatContext,
)

mlflow.openai.autolog()

# Optional: Replace with any model serving endpoint
LLM_ENDPOINT_NAME = "databricks-meta-llama-3-3-70b-instruct"


class SimpleChatAgent(ChatAgent):
    def __init__(self):
        self.workspace_client = WorkspaceClient()
        self.client = self.workspace_client.serving_endpoints.get_open_ai_client()
        self.llm_endpoint = LLM_ENDPOINT_NAME

        # Fake documents to simulate the retriever
        self.documents = [
            mlflow.entities.Document(
                metadata={"doc_uri": "uri1.txt"},
                page_content="""Lakehouse Monitoring for GenAI helps you monitor the quality, cost, and latency of production GenAI apps.  Lakehouse Monitoring for GenAI allows you to:\n- Track quality and operational performance (latency, request volume, errors, etc.).\n- Run LLM-based evaluations on production traffic to detect drift or regressions using Agent Evaluation's LLM judges.\n- Deep dive into individual requests to debug and improve agent responses.\n- Transform real-world logs into evaluation sets to drive continuous improvements.""",
            ),
            # This is a new document about spark.
            mlflow.entities.Document(
                metadata={"doc_uri": "uri2.txt"},
                page_content="The latest spark version in databricks in 3.5.0",
            ),
        ]

        # Tell Agent Evaluation's judges and review app about the schema of your retrieved documents
        mlflow.models.set_retriever_schema(
            name="fake_vector_search",
            primary_key="doc_uri",
            text_column="page_content",
            doc_uri="doc_uri"
            # other_columns=["column1", "column2"],
        )

    @mlflow.trace(span_type=SpanType.RETRIEVER)
    def dummy_retriever(self):
      # Fake retriever
      return self.documents
  

    def prepare_messages_for_llm(
        self, messages: list[ChatAgentMessage]
    ) -> list[dict[str, Any]]:
        """Filter out ChatAgentMessage fields that are not compatible with LLM message formats"""
        compatible_keys = ["role", "content", "name", "tool_calls", "tool_call_id"]
        return [
            {
                k: v
                for k, v in m.model_dump_compat(exclude_none=True).items()
                if k in compatible_keys
            }
            for m in messages
        ]

    @mlflow.trace(span_type=SpanType.PARSER)
    def prepare_rag_prompt(self, messages):

        docs = self.dummy_retriever()

        messages = self.prepare_messages_for_llm(messages)

        messages[-1]['content'] = f"Answer the user's question based on the documents.\nDocuments: <documents>{docs}</documents>.\nUser's question: <user_question>{messages[-1]['content']}</user_question>"

        return messages

    @mlflow.trace(span_type=SpanType.AGENT)
    def predict(
        self,
        messages: list[ChatAgentMessage],
        context: Optional[ChatContext] = None,
        custom_inputs: Optional[dict[str, Any]] = None,
    ) -> ChatAgentResponse:
        
        messages = self.prepare_rag_prompt(messages)

        resp = self.client.chat.completions.create(
            model=self.llm_endpoint,
            messages=messages,
        )

        return ChatAgentResponse(
            messages=[
                ChatAgentMessage(**resp.choices[0].message.to_dict(), id=resp.id)
            ],
        )

    @mlflow.trace(span_type=SpanType.AGENT)
    def predict_stream(
        self,
        messages: list[ChatAgentMessage],
        context: Optional[ChatContext] = None,
        custom_inputs: Optional[dict[str, Any]] = None,
    ) -> Generator[ChatAgentChunk, None, None]:
        
        messages = self.prepare_rag_prompt(messages)

        for chunk in self.client.chat.completions.create(
            model=self.llm_endpoint,
            messages=messages,
            stream=True,
        ):
            if not chunk.choices or not chunk.choices[0].delta.content:
                continue

            yield ChatAgentChunk(
                delta=ChatAgentMessage(
                    **{
                        "role": "assistant",
                        "content": chunk.choices[0].delta.content,
                        "id": chunk.id,
                    }
                )
            )


from mlflow.models import set_model

AGENT = SimpleChatAgent()
set_model(AGENT)
Writing hello_world_agent.py

Test the agent locally

8
%load_ext autoreload 
%autoreload 2
9
from hello_world_agent import AGENT
AGENT.predict({
        "messages": [{"role": "user", "content": "How do I monitor my genai app?"}]
    })
ChatAgentResponse(messages=[ChatAgentMessage(role='assistant', content='To monitor your GenAI app, you can use Lakehouse Monitoring for GenAI, which allows you to track quality and operational performance, including latency, request volume, and errors. Additionally, you can run LLM-based evaluations on production traffic to detect drift or regressions, and deep dive into individual requests to debug and improve agent responses. This feature also enables you to transform real-world logs into evaluation sets to drive continuous improvements.', name=None, id='chatcmpl_986a61f3-873b-4ebd-8f9b-8de066e3c9f8', tool_calls=None, tool_call_id=None, attachments=None)], finish_reason=None, custom_outputs=None, usage=None)
Trace(request_id=tr-6259e3860b964eeab635f07fe72e82db)
10
import mlflow
from mlflow.models.resources import DatabricksServingEndpoint

with mlflow.start_run():
  model_info = mlflow.pyfunc.log_model(
    python_model="hello_world_agent.py",
    artifact_path="agent",
    input_example={
        "messages": [{"role": "user", "content": "How do I monitor my genai app?"}]
    },  
    resources=[DatabricksServingEndpoint(endpoint_name="databricks-meta-llama-3-3-70b-instruct")],
    pip_requirements=["databricks-sdk[openai]", "mlflow", "databricks-agents", "backoff"],    
  )
2025/03/13 13:19:58 INFO mlflow.pyfunc: Predicting on input example to validate output
11
# Let's validate that the model can be loaded, and try invoking it.
mlflow.models.predict(
    model_uri=model_info.model_uri,
    input_data={"messages": [{"role": "user", "content": "How do I monitor my genai app?"}]},
    env_manager="uv",
)
2025/03/13 13:20:08 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'
2025/03/13 13:20:09 INFO mlflow.utils.virtualenv: Creating a new environment in /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43 with python version 3.12.3 using uv Using CPython 3.12.3 interpreter at: /usr/bin/python3.12 Creating virtual environment at: /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43 Activate with: source /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43/bin/activate 2025/03/13 13:20:10 INFO mlflow.utils.virtualenv: Installing dependencies Using Python 3.12.3 environment at: /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43 Resolved 3 packages in 84ms Downloading pip (2.0MiB) Downloading setuptools (1.2MiB) Downloaded pip Downloaded setuptools Prepared 3 packages in 173ms warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. If the cache and target directories are on different filesystems, hardlinking may not be supported. If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. Installed 3 packages in 44ms + pip==24.0 + setuptools==74.0.0 + wheel==0.43.0 Using Python 3.12.3 environment at: /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43 Resolved 120 packages in 850ms Downloading mlflow (26.6MiB) Downloading matplotlib (8.2MiB) Downloading cryptography (4.0MiB) Downloading kiwisolver (1.4MiB) Downloading grpcio (5.6MiB) Downloading pandas (12.1MiB) Downloading fonttools (4.6MiB) Downloading numpy (17.1MiB) Downloading pillow (4.3MiB) Downloading scipy (35.6MiB) Downloading pyarrow (40.1MiB) Downloading zstandard (5.2MiB) Downloading sqlalchemy (3.1MiB) Downloading botocore (12.8MiB) Downloading tiktoken (1.1MiB) Downloading scikit-learn (12.5MiB) Downloading mlflow-skinny (5.8MiB) Downloading pydantic-core (1.9MiB) Downloading databricks-connect (2.3MiB) Downloaded tiktoken Downloaded kiwisolver Downloaded pydantic-core Downloaded sqlalchemy Downloaded databricks-connect Downloaded pillow Downloaded cryptography Downloaded fonttools Downloaded zstandard Downloaded grpcio Downloaded matplotlib Downloaded mlflow-skinny Downloaded scikit-learn Downloaded numpy Downloaded pandas Downloaded mlflow Downloaded botocore Downloaded scipy Downloaded pyarrow Prepared 119 packages in 3.88s warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. If the cache and target directories are on different filesystems, hardlinking may not be supported. If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. Installed 119 packages in 668ms + alembic==1.15.1 + annotated-types==0.7.0 + anyio==4.8.0 + azure-core==1.32.0 + azure-storage-blob==12.25.0 + azure-storage-file-datalake==12.19.0 + backoff==2.2.1 + blinker==1.9.0 + boto3==1.37.11 + botocore==1.37.11 + cachetools==5.5.2 + certifi==2025.1.31 + cffi==1.17.1 + charset-normalizer==3.4.1 + click==8.1.8 + cloudpickle==3.1.1 + contourpy==1.3.1 + cryptography==44.0.2 + cycler==0.12.1 + databricks-agents==0.17.2 + databricks-connect==16.2.0 + databricks-sdk==0.46.0 + dataclasses-json==0.6.7 + deprecated==1.2.18 + distro==1.9.0 + docker==7.1.0 + fastapi==0.115.11 + flask==3.1.0 + fonttools==4.56.0 + gitdb==4.0.12 + gitpython==3.1.44 + google-api-core==2.24.2 + google-auth==2.38.0 + google-cloud-core==2.4.3 + google-cloud-storage==3.1.0 + google-crc32c==1.6.0 + google-resumable-media==2.7.2 + googleapis-common-protos==1.69.1 + graphene==3.4.3 + graphql-core==3.2.6 + graphql-relay==3.2.0 + greenlet==3.1.1 + grpcio==1.71.0 + grpcio-status==1.71.0 + gunicorn==23.0.0 + h11==0.14.0 + httpcore==1.0.7 + httpx==0.28.1 + idna==3.10 + importlib-metadata==8.6.1 + isodate==0.7.2 + itsdangerous==2.2.0 + jinja2==3.1.6 + jiter==0.9.0 + jmespath==1.0.1 + joblib==1.4.2 + jsonpatch==1.33 + jsonpointer==3.0.0 + kiwisolver==1.4.8 + langchain-core==0.3.45rc1 + langchain-openai==0.3.9rc1 + langsmith==0.3.14rc1 + mako==1.3.9 + markdown==3.7 + markupsafe==3.0.2 + marshmallow==3.26.1 + matplotlib==3.10.1 + mlflow==2.21.0rc0 + mlflow-skinny==2.21.0rc0 + mypy-extensions==1.0.0 + numpy==1.26.4 + openai==1.66.3 + opentelemetry-api==1.31.0 + opentelemetry-sdk==1.31.0 + opentelemetry-semantic-conventions==0.52b0 + orjson==3.10.15 + packaging==24.2 + pandas==2.2.3 + pillow==11.1.0 + proto-plus==1.26.1 + protobuf==5.29.3 + py4j==0.10.9.7 + pyarrow==19.0.1 + pyasn1==0.6.1 + pyasn1-modules==0.4.1 + pycparser==2.22 + pydantic==2.11.0b1 + pydantic-core==2.31.1 + pyparsing==3.2.1 + python-dateutil==2.9.0.post0 + pytz==2025.1 + pyyaml==6.0.2 + regex==2024.11.6 + requests==2.32.3 + requests-toolbelt==1.0.0 + rsa==4.9 + s3transfer==0.11.4 + scikit-learn==1.6.1 + scipy==1.15.2 + six==1.17.0 + smmap==5.0.2 + sniffio==1.3.1 + sqlalchemy==2.0.39 + sqlparse==0.5.3 + starlette==0.46.1 + tenacity==9.0.0 + threadpoolctl==3.5.0 + tiktoken==0.9.0 + tqdm==4.67.1 + typing-extensions==4.12.2 + typing-inspect==0.9.0 + typing-inspection==0.4.0 + tzdata==2025.1 + urllib3==2.3.0 + uvicorn==0.34.0 + werkzeug==3.1.3 + wrapt==1.17.2 + zipp==3.21.0 + zstandard==0.23.0 2025/03/13 13:20:16 INFO mlflow.utils.environment: === Running command '['bash', '-c', 'source /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43/bin/activate && python -c ""']' 2025/03/13 13:20:16 INFO mlflow.utils.environment: === Running command '['bash', '-c', 'source /local_disk0/.ephemeral_nfs/repl_tmp_data/ReplId-1958f-7bf95-0/mlflow/envs/virtualenv_envs/mlflow-97a0af8652bd98e99256b2e6911ea0243c4c1e43/bin/activate && python /local_disk0/.ephemeral_nfs/envs/pythonEnv-86a069c3-b27e-497b-b901-ef7491e429e3/lib/python3.12/site-packages/mlflow/pyfunc/_mlflow_pyfunc_backend_predict.py --model-uri file:///local_disk0/repl_tmp_data/ReplId-1958f-7bf95-0/tmpjz26fq6b/agent --content-type json --input-path /local_disk0/repl_tmp_data/ReplId-1958f-7bf95-0/tmpk7cne3jf/input.json']' 2025/03/13 13:20:23 WARNING mlflow.tracing.processor.mlflow: Creating a trace within the default experiment with id '0'. It is strongly recommended to not use the default experiment to log traces due to ambiguous search results and probable performance issues over time due to directory table listing performance degradation with high volumes of directories within a specific path. To avoid performance and disambiguation issues, set the experiment for your environment using `mlflow.set_experiment()` API. 2025/03/13 13:20:23 WARNING mlflow.tracking.client: Failed to start trace Completions: RESOURCE_DOES_NOT_EXIST: Experiment with id '0' does not exist.. For full traceback, set logging level to debug. {"messages": [{"role": "assistant", "content": "To monitor your GenAI app, you can use Lakehouse Monitoring for GenAI, which allows you to track quality and operational performance, including latency, request volume, and errors. Additionally, you can run LLM-based evaluations on production traffic to detect drift or regressions and deep dive into individual requests to debug and improve agent responses. This tool also enables you to transform real-world logs into evaluation sets to drive continuous improvements.", "id": "chatcmpl_6458e833-bf5d-416f-a66f-8eaed74b99ed"}]}
12
from databricks.agents import deploy

# Set the name of the model to use in your Unity Catalog schema defined at the top of this notebook

MODEL_NAME = "my_demo_agent"

# Register the model in Unity Catalog and deploy it as a serving endpoint
mlflow.set_registry_uri("databricks-uc")
uc_model_info = mlflow.register_model(
    model_uri=model_info.model_uri, name=f"{UC_CATALOG}.{UC_SCHEMA}.{MODEL_NAME}"
)
deployment = deploy(model_name=uc_model_info.name, model_version=uc_model_info.version)
Successfully registered model 'agents_demo.playground.my_demo_agent'.
Created version '1' of model 'agents_demo.playground.my_demo_agent'.
/local_disk0/.ephemeral_nfs/envs/pythonEnv-86a069c3-b27e-497b-b901-ef7491e429e3/lib/python3.12/site-packages/mlflow/pyfunc/utils/data_validation.py:168: UserWarning: Add type hints to the `predict` method to enable data validation and automatic signature inference during model logging. Check https://mlflow.org/docs/latest/model/python_model.html#type-hint-usage-in-pythonmodel for more details. color_warning( Deployment of agents_demo.playground.my_demo_agent version 1 initiated. This can take up to 15 minutes and the Review App & Query Endpoint will not work until this deployment finishes. View status: https://e2-demo-field-eng.cloud.databricks.com/ml/endpoints/agents_agents_demo-playground-my_demo_agent Review App: https://e2-demo-field-eng.cloud.databricks.com/ml/review-v2/a4aefeb7246c4dca87409eeb1ed2575b/chat Monitor: https://e2-demo-field-eng.cloud.databricks.com/ml/experiments/1315887243144609/evaluation-monitoring?endpointName=agents_agents_demo-playground-my_demo_agent
13
from databricks.sdk.service.serving import EndpointStateReady, EndpointStateConfigUpdate
from databricks.sdk import WorkspaceClient
import time

print("\nWaiting for endpoint to deploy.  This can take 10 - 20 minutes.", end="")
w = WorkspaceClient()
while w.serving_endpoints.get(deployment.endpoint_name).state.ready == EndpointStateReady.NOT_READY or w.serving_endpoints.get(deployment.endpoint_name).state.config_update == EndpointStateConfigUpdate.IN_PROGRESS:
    print(".", end="")
    time.sleep(30)

print("\nREADY!")
Waiting for endpoint to deploy. This can take 10 - 20 minutes. READY!

Configuring Quality Monitoring Metrics

Since our agent was deployed using agents.deploy, basic monitoring (request volume, latency, errors) is already set up automatically. The monitor is attached to this notebook's MLflow Experiment by default.

Now we'll add quality evaluation metrics that use LLM judges to our monitoring. The monitoring configuration specified here will:

  • Sample 100% of requests for evaluation
  • Evaluate responses against safety, relevance, chunk relevance, groundedness (lack of hallucinations) and custom guidelines
Agent Evaluation's built-in judges
  • Judges that run without ground-truth labels or retrieval in traces:
    • guideline_adherence: guidelines allows developers write plain-language checklists or rubrics in their evaluation, improving transparency and trust with business stakeholders through easy-to-understand, structured grading rubrics.
    • safety: making sure the response is safe
    • relevance_to_query: making sure the response is relevant
  • For traces with retrieved docs (spans of type RETRIEVER):
    • groundedness: detect hallucinations
    • chunk_relevance: chunk-level relevance to the query

See the full list of built-in judges (AWS | Azure).

15
from databricks.agents.evals.monitors import create_monitor, get_monitor, update_monitor, delete_monitor

# Get the current monitor configuration 
monitor = get_monitor(endpoint_name=deployment.endpoint_name)
16
# Update the monitor to add evaluation metrics
monitor = update_monitor(
    endpoint_name=deployment.endpoint_name,
    monitoring_config={
        "sample": 1,  # Sample 100% of requests - this can be any number from 0 (0%) to 1 (100%).
        # Select 0+ of Agent Evaluation's built-in judges
        "metrics": ['guideline_adherence', 'groundedness', 'safety', 'relevance_to_query', 'chunk_relevance'],
        # Customize these guidelines based on your business requirements.  These guidelines will be analyzed using Agent Evaluation's built in guideline_adherence judge
        "global_guidelines": {
            "english": ["The response must be in English."],
            "clarity": ["The response must be clear, coherent, and concise."],
            "relevant_if_not_refusal": ["Determine if the response provides an answer to the user's request.  A refusal to answer is considered relevant.  However, if the response is NOT a refusal BUT also doesn't provide relevant information, then the answer is not relevant."],
            "no_answer_if_no_docs": ["If the agent can not find a relevant document, it should refuse to answer the question and not discuss the reasons why it could not answer."]
        }
    }
)

Generate Sample Traffic

Now that our endpoint is deployed, we'll send some sample questions to generate traffic for monitoring.

Helper function to send simulated traffic

19
from mlflow import deployments

client = deployments.get_deploy_client("databricks")

questions = [
    "What is Mosaic AI Agent Evaluation?",
    "How do you use MLflow with Databricks for experiment tracking?",
    "What should I use Databricks Feature Store for?",
    "How does AutoML work in Databricks?",
    "What is Model Serving in Databricks and what are its deployment options?",
    "How does Databricks handle distributed deep learning training?",
    "Does Unity Catalog support models?",
    "What is the Databricks Lakehouse?",
    "Which Llama models are supported on Databricks?",
    "How does Databricks integrate with popular ML frameworks like PyTorch and TensorFlow?"
]

for i, question in enumerate(questions, 1):
    print(f"\nQuestion {i}: {question}")  
    response = client.predict(
        endpoint=deployment.endpoint_name,
        inputs={
            "messages": [
                {"role": "user", "content": question}
            ]
        }
    )
    print(response)
    
Question 1: What is Mosaic AI Agent Evaluation? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "0693d8bc-fc31-4536-b81f-4ce4dd6a45a2", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 2: How do you use MLflow with Databricks for experiment tracking? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "4eccc481-eeb6-4bdf-978e-4eb296396fc0", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 3: What should I use Databricks Feature Store for? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "3d4c781b-b20a-4709-8fcc-26f68e48108b", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 4: How does AutoML work in Databricks? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "90bd6f75-96da-44c5-a3e9-aa9d2532ade6", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 5: What is Model Serving in Databricks and what are its deployment options? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "799e227c-164f-4ae8-8311-bf6d104ad6ee", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 6: How does Databricks handle distributed deep learning training? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "4ea9668a-b41a-4fe3-805f-92e3ff401ae4", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 7: Does Unity Catalog support models? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "36e7e158-c1dc-4a30-8d95-acb76dfd8468", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 8: What is the Databricks Lakehouse? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "226fc72e-7cec-4a5a-86a7-3e4f21ea46a1", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 9: Which Llama models are supported on Databricks? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "3ca3f1ea-d192-493d-bb41-64da548dc67c", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]} Question 10: How does Databricks integrate with popular ML frameworks like PyTorch and TensorFlow? Attaching feedback: {"dataframe_records": [{"source": {"id": "fake.user@databricks.com", "type": "human"}, "request_id": "e8b382bb-4001-4e8d-91bd-9e1373a2839f", "text_assessments": [{"ratings": {"answer_correct": {"value": "positive"}}, "free_text_comment": "", "suggested_output": ""}], "retrieval_assessments": []}]}

[Optional] Enable integration with Review App and Evaluation Sets

To fix any quality issues identified in the monitoring dashboard, you can:

  1. Copy the production trace to an evaluation dataset to use it as a test case in mlflow.evaluate(...)
  2. Send the production trace to the Review App to collect domain expert input/labels

To enable these features, you need to create an Evaluation Set and Labeling Session. For more information on these concepts, see the documentation.

1. Create an evaluation dataset

Your monitor will show all evaluation datasets linked to the MLflow Eperiment where the monitor is configured - by default, this is the Notebook's MLflow Experiment.

22
from databricks.agents import datasets
from databricks.sdk.errors.platform import NotFound

# Make sure you have updated the uc_catalog & uc_schema widgets to a valid catalog/schema where you have CREATE TABLE permissions.
EVAL_DATASET_NAME = "agent_evaluation_set"

UC_TABLE_NAME = f'{UC_CATALOG}.{UC_SCHEMA}.{EVAL_DATASET_NAME}'

# Remove the evaluation dataset if it already exists
try:
  datasets.delete_dataset(UC_TABLE_NAME)
except NotFound:
  pass

# Create the evaluation dataset
dataset = datasets.create_dataset(UC_TABLE_NAME)

2. Create a labeling session

Your monitor will show all labeling sessions linked to the MLflow Eperiment where the monitor is configured - by default, this is the Notebook's MLflow Experiment.

24
from databricks.agents import review_app

# OPTIONAL: Add a comma separated list of domain experts who will provide feedback/alebsl
# If not provided, only the user running this notebook will be granted access to the review app.
DOMAIN_EXPERT_EMAILS = []

# Get the Review App from the current MLflow Experiment
my_review_app = review_app.get_review_app()

# Optional: Add a custom question for your domain experts
my_review_app.create_label_schema(
  name="good_response",
  # Type can be "expectation" or "feedback".
  type="feedback",
  title="Is this a good response?",
  input=review_app.label_schemas.InputCategorical(options=["Yes", "No"]),
  instruction="Optional: provide a rationale below.",
  enable_comment=True,
  overwrite=True
)

my_session = my_review_app.create_labeling_session(
    name="collect_facts",
    assigned_users=DOMAIN_EXPERT_EMAILS, # If not provided, only the user running this notebook will be granted access
    # Built-in labeling schemas: EXPECTED_FACTS, GUIDELINES, EXPECTED_RESPONSE
    label_schemas=[review_app.label_schemas.GUIDELINES,  "good_response"],
)

# URLs to share with the SME.
print("Review App URL:", my_review_app.url)
print("Labeling session URL: ", my_session.url)

Viewing Monitoring Results

The monitoring results are stored in Delta tables and can be accessed in two ways:

  1. Through the MLflow UI (click the link generated above)
  2. Directly querying the Delta table containing evaluated traces

Below, we'll query the Delta table to see the evaluation results, filtering out skipped evaluations.

If you do not see monitoring results, wait until the next run of the monitoring job.

26
# Read evaluated traces from Delta
display(spark.table(monitor.evaluated_traces_table).filter("evaluation_status != 'skipped'"))
NameError: name 'monitor' is not defined

Cleanup

When you're done with the demo, you can delete the endpoint and the monitor using the code below.

28
databricks.agents.delete_deployment(model_name=uc_model_info.name, model_version=uc_model_info.version)
;