Deploy an agent for generative AI application
Preview
This feature is in Public Preview.
This article shows how to deploy your AI agent using the deploy()
function from the Agent Framework Python API.
Requirements
MLflow 2.13.1 or above to deploy agents using the the
deploy()
API fromdatabricks.agents
.Register an AI agent to Unity Catalog. See Register the chain to Unity Catalog.
Install the the
databricks-agents
SDK.%pip install databricks-agents dbutils.library.restartPython()
Deploy an agent using deploy()
The deploy() function does the following:
Creates CPU model serving endpoints for your agent that can be integrated into your user-facing application.
To reduce cost for idle endpoints (at the expense of increased time to serve initial queries), you can enable scale to zero for your serving endpoint by passing
scale_to_zero_enabled=True
todeploy()
. See Endpoint scaling expectations.Inference tables are enabled on these model serving endpoints. See Inference tables for monitoring and debugging models.
Authentication credentials are automatically passed to all Databricks-managed resources required by the agent as specified when logging the model. Databricks creates a service principal that has access to these resources, and automatically passes that into the endpoint. See Authentication for dependent resources.
If you have resource dependencies that are not Databricks-managed, for example using Pinecone, you can pass in environment variables with secrets to the
deploy()
API. See Configure access to resources from model serving endpoints.
Enables the Review App for your agent. The Review App allows your stakeholders to chat with the agent and give feedback using the Review App UI.
Logs every request to the Review App or REST API to an inference table. The data logged includes query requests, responses, and intermediate trace data from MLflow Tracing.
Creates a feedback model with the same catalog and schema as the agent you are trying to deploy. This feedback model is the mechanism that makes it possible to accept feedback from the Review App and log it to an inference table. This model is served in the same CPU model serving endpoint as your deployed agent. Because this serving endpoint has inference tables enabled, it is possible to log feedback from the Review App to an inference table.
Note
Deployments can take up to 15 minutes to complete. Raw JSON payloads take 10 - 30 minutes to arrive, and the formatted logs are processed from the raw payloads about every hour.
from databricks.agents import deploy
from mlflow.utils import databricks_utils as du
deployment = deploy(model_fqn, uc_model_info.version)
# query_endpoint is the URL that can be used to make queries to the app
deployment.query_endpoint
# Copy deployment.rag_app_url to browser and start interacting with your RAG application.
deployment.rag_app_url
Agent-enhanced inference tables
The deploy()
creates three inference tables for each deployment to log requests and responses to and from the agent serving endpoint. Users can expect the data to be in the payload table within an hour of interacting with their deployment.
Payload request logs and assessment logs might take longer to populate, but are ultimately derived from the raw payload table. You can extract request and assessment logs from the payload table yourself. Deletions and updates to the payload table are not reflected in the payload request logs or the payload assessment logs.
Table |
Example Unity Catalog table name |
What is in each table |
---|---|---|
Payload |
|
Raw JSON request and response payloads |
Payload request logs |
|
Formatted request and responses, MLflow traces |
Payload assessment logs |
|
Formatted feedback, as provided in the Review App, for each request |
The following shows the schema for the request logs table.
Column name |
Type |
Description |
---|---|---|
|
String |
Client request ID, usually |
|
String |
Databricks request ID. |
|
Date |
Date of request. |
|
Long |
Timestamp in milliseconds. |
|
Timestamp |
Timestamp of the request. |
|
Integer |
Status code of endpoint. |
|
Long |
Total execution milliseconds. |
|
String |
Conversation id extracted from request logs. |
|
String |
The last user query from the user’s conversation. This is extracted from the RAG request. |
|
String |
The last response to the user. This is extracted from the RAG request. |
|
String |
String representation of request. |
|
String |
String representation of response. |
|
String |
String representation of trace extracted from the |
|
Double |
Sampling fraction. |
|
Map[String, String] |
A map of metadata related to the model serving endpoint associated with the request. This map contains the endpoint name, model name, and model version used for your endpoint. |
|
String |
Integer for the schema version. |
The following is the schema for the assessment logs table.
Column name |
Type |
Description |
---|---|---|
|
String |
Databricks request ID. |
|
String |
Derived from retrieval assessment. |
|
Struct |
A struct field containing the information on who created the assessment. |
|
Timestamp |
Timestamp of request. |
|
Struct |
A struct field containing the data for any feedback on the agent’s responses from the review app. |
|
Struct |
A struct field containing the data for any feedback on the documents retrieved for a response. |
Permission requirements for dependent resources
When deploying a model with dependent resources, the creator of the endpoint must have the following permissions depending on the resource type:
Resource type |
Permission |
---|---|
Sql Warehouse |
Use Endpoint |
Model Serving Endpoint |
Can Query |
Unity Catalog Function |
Execute |
Genie space |
Execute |
Vector Search Index |
ReadVectorIndex |
Unity Catalog Table |
Can Read |
Authentication for dependent resources
When creating the model serving endpoint for agent deployment, Databricks verifies that the creator of the endpoint has permissions to access all resources on which the agent depends.
For LangChain flavored agents, dependent resources are automatically inferred during agent creation and logging. Those resources are logged in the resources.yaml
file in the logged model artifact. During deployment, databricks.agents.deploy
automatically creates the M2M OAuth tokens required to access and communicate with these inferred resource dependencies.
For PyFunc flavored agents, you must manually specify any resource dependencies during logging of the deployed agent in the resources
parameter. See Specify resources for PyFunc or LangChain agent.
During deployment, databricks.agents.deploy
creates an M2M OAuth token with access to the resources specified in the resources
parameter, and deploys it to the deployed agent.
Automatic authentication passthrough
The following table lists the features that support automatic authentication passthrough. Automatic authentication passthrough uses the credentials of the deployment creator to automatically authenticate against supported features.
Feature |
Minimum |
---|---|
Vector search indexes |
Requires |
Model Serving endpoints |
Requires |
SQL warehouses |
Requires |
Unity Catalog Functions |
Requires |
Manual authentication
If you have a dependent resource that does not support automatic authentication passthrough, or if you want to use credentials other than those of the deployment creator, you can manually provide credentials using secrets-based environment variables. For example, if using the Databricks SDK in your agent to access other types of dependent resources, you can set the environment variables described in Databricks client unified authentication.
Get deployed applications
The following shows how to get your deployed agents.
from databricks.agents import list_deployments, get_deployments
# Get the deployment for specific model_fqn and version
deployment = get_deployments(model_name=model_fqn, model_version=model_version.version)
deployments = list_deployments()
# Print all the current deployments
deployments
Provide feedback on a deployed agent (experimental)
When you deploy your agent with agents.deploy()
, agent framework also creates and deploys a “feedback” model version within the same endpoint, which you can query to provide feedback on your agent application. Feedback entries appear as request rows within the inference table associated with your agent serving endpoint.
Note that this behavior is experimental: Databricks may provide a first-class API for providing feedback on a deployed agent in the future, and future functionality may require migrating to this API.
Limitations of this API include:
The feedback API lacks input validation - it always responds successfully, even if passed invalid input.
The feedback API requires passing in the Databricks-generated
request_id
of the agent endpoint request on which you wish to provide feedback. To get thedatabricks_request_id
, include{"databricks_options": {"return_trace": True}}
in your original request to the agent serving endpoint. The agent endpoint response will then include thedatabricks_request_id
associated with the request, so that you can pass that request ID back to the feedback API when providing feedback on the agent response.Feedback is collected using inference tables. See inference table limitations.
The following example request provides feedback on the agent endpoint named “your-agent-endpoint-name”, and assumes that the DATABRICKS_TOKEN
environment variable is set to a Databricks REST API token.
curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '
{
"dataframe_records": [
{
"source": {
"id": "user@company.com",
"type": "human"
},
"request_id": "573d4a61-4adb-41bd-96db-0ec8cebc3744",
"text_assessments": [
{
"ratings": {
"answer_correct": {
"value": "positive"
},
"accurate": {
"value": "positive"
}
},
"free_text_comment": "The answer used the provided context to talk about Delta Live Tables"
}
],
"retrieval_assessments": [
{
"ratings": {
"groundedness": {
"value": "positive"
}
}
}
]
}
]
}' \
https://<workspace-host>.databricks.com/serving-endpoints/<your-agent-endpoint-name>/served-models/feedback/invocations
You can pass additional or different key-value pairs in the text_assessments.ratings
and retrieval_assessments.ratings
fields to provide different types of feedback. In the example, the feedback payload indicates that the agent’s response to the request with ID 573d4a61-4adb-41bd-96db-0ec8cebc3744
was correct, accurate, and grounded in context fetched by a retriever tool.