Skip to main content

Debug a deployed AI agent

This page covers how to debug common issues with AI agents deployed on Databricks.

Go to:

Most debugging sections on this page apply to agents deployed to Databricks Apps. However, you can also find debugging information for agents deployed on Model Serving (legacy) using the tab selectors.

Author agents using best practices

Use the following best practices when authoring agents:

  • Enable MLflow tracing: Follow the best practices in Author an AI agent and deploy it on Databricks Apps. Enable MLflow trace autologging to make your agents easier to debug.
  • Document tools clearly: Clear tool and parameter descriptions ensure your agent understands your tools and uses them appropriately. See Improve tool-calling with clear documentation.
  • Add timeouts and token limits to LLM calls: Add timeouts and token limits to the LLM calls in your code to avoid delays caused by long-running steps.
    • If your agent uses the OpenAI client to query a Databricks LLM serving endpoint, set custom timeouts on the serving endpoint calls as needed.
  • Validate configuration before deployment: Run databricks bundle validate before you deploy to catch YAML configuration issues early. This helps identify mismatched resource references, invalid permissions, and syntax errors.
  • Test locally first: Use local development to catch issues before you deploy. Start your agent server locally, test with sample requests, and verify that MLflow traces appear correctly before you deploy to Databricks Apps.

Debug local development issues

Test your agent locally to identify issues before deployment.

Before you run your agent locally, verify that your environment is configured correctly:

  1. Check Databricks CLI version: Run databricks -v to verify that you have version 0.283.0 or later.

  2. Verify CLI profiles: Run databricks auth profiles to see the configured authentication profiles.

  3. Validate environment configuration: Check that your .env file contains the required variables, especially MLFLOW_TRACKING_URI, which must use the format databricks://PROFILE_NAME to include your CLI profile.

Common local development errors

Error

Cause

Solution

The provided MLFLOW_EXPERIMENT_ID does not exist

Wrong tracking URI format or experiment was deleted

Verify that MLFLOW_TRACKING_URI uses the databricks://PROFILE_NAME format with your CLI profile name

Module not found

Dependencies not installed

Run uv sync to install dependencies

Port already in use

Another process using the port

Use --port flag to specify a different port (e.g., uv run start-app --port 8001)

Authentication errors when running locally

The environment is not configured

Run the quickstart script or manually configure the .env file with your CLI profile

Test the agent locally

To test your agent before deployment:

  1. Start the agent server locally:

    Bash
    uv run start-app
  2. In another terminal, send a test request:

    Bash
    curl -X POST http://localhost:8000/invocations \
    -H "Content-Type: application/json" \
    -d '{"input": [{"role": "user", "content": "hello"}]}'
  3. View MLflow traces in the Databricks UI to verify your agent is logging traces correctly.

Debug configuration issues

Configuration errors in databricks.yml and app.yaml are common sources of deployment failures.

Validate the Databricks Asset Bundles configuration

Validate the Databricks Asset Bundles configuration before deploying the app:

Bash
databricks bundle validate

This command checks your configuration for:

  • YAML syntax errors
  • Missing required fields
  • Not valid resource references
  • Permission configuration issues

Common configuration mismatches

Configuration point

Rule

How to debug

valueFrom references in app.yaml

Must exactly match a resource name in databricks.yml

Search for the exact string in both files to verify they match

App name

Must start with the agent- prefix (e.g., agent-data-analyst)

Check the name field under resources.apps in databricks.yml

Genie space ID

Must be the 32-character hex string from the Genie URL

Extract from the URL path: https://workspace.cloud.databricks.com/genie/rooms/{SPACE_ID}

Unity Catalog function reference

Must use format catalog.schema.function_name

Verify the function exists using databricks unity-catalog functions list

Lakebase instance reference

Must use value (not valueFrom) in the app.yaml file

The instance name is a literal string, not a resource reference

Debug deployment issues

App already exists error

App already exists error

If you see Error: failed to create app - An app with the same name already exists, you have two options:

Option 1: Bind to existing app (recommended)

Bash
# Get existing app configuration
databricks apps get <app-name> --output json

# Sync the configuration to your databricks.yml, then bind
databricks bundle deployment bind <bundle-name> <app-name> --auto-approve

# Deploy
databricks bundle deploy
databricks bundle run <bundle-name>

Option 2: Delete and recreate

Bash
databricks apps delete <app-name>
databricks bundle deploy
databricks bundle run <bundle-name>

App not updating after deployment

App not updating after deploy

databricks bundle deploy only uploads files to the workspace. You must also run databricks bundle run <bundle-name> to restart the app with the new code.

Always deploy using both commands:

Bash
databricks bundle deploy && databricks bundle run <bundle-name>

View deployment status and logs

View deployment status and logs

To check your app's deployment status:

Bash
databricks apps get <app-name>

To view app logs in real-time:

Bash
databricks apps logs <app-name> --follow

Debug runtime errors

Use app logs and request testing to identify issues with your deployed agent.

Analyze app logs

View real-time logs from your deployed app:

Bash
databricks apps logs <app-name> --follow

Look for:

  • Stack traces indicating code errors
  • Permission denied messages for resources
  • Connection errors to external services
  • Timeout messages

Common runtime errors

Error

Cause

Solution

302 redirect when querying app

Using Personal Access Token instead of OAuth

Get an OAuth token with databricks auth token

Agent not using available tools

Tools not returned from MCP client

Verify the MCP server URL is correct and the resource has proper permissions in databricks.yml

Streaming response breaks mid-response

Connection timeout

Increase the CHAT_PROXY_TIMEOUT_SECONDS environment variable in app.yaml

Agent returning "Memory not available"

Missing user_id in request

Pass custom_inputs.user_id in the request payload

Empty or error responses despite 200 status

Error occurred within streamed response

Check the actual stream content and app logs, not just the HTTP status code

Debug authentication errors

OAuth token authentication required

OAuth token authentication required

You must use a Databricks OAuth token to query agents deployed to Apps. Using a Personal Access Token (PAT) results in a 302 redirect error.

To get an OAuth token:

Bash
databricks auth token

Use the token in requests to your deployed app:

Bash
TOKEN=$(databricks auth token | jq -r '.access_token')
curl -X POST <app-url>/invocations \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "user", "content": "hello"}]}'

Resource permission errors

Resource permission errors

When your agent cannot access workspace resources, verify the resource is properly configured in databricks.yml. Each resource type requires specific permissions:

Error

Cause

Solution

Permission denied on Genie space

Missing genie_space resource

Add a genie_space resource with permission: 'CAN_RUN'

Vector search index not accessible

Missing uc_securable resource for the index

Add a uc_securable resource with securable_type: 'TABLE' and permission: 'SELECT'

Unity Catalog function execution denied

Missing uc_securable resource for the function

Add a uc_securable resource with securable_type: 'FUNCTION' and permission: 'EXECUTE'

Serving endpoint access denied

Missing serving_endpoint resource

Add a serving_endpoint resource with permission: 'CAN_QUERY'

SQL warehouse access denied

Missing sql_warehouse resource

Add a sql_warehouse resource with permission: 'CAN_USE'

Example resource configuration in databricks.yml:

YAML
resources:
apps:
my_agent:
name: 'agent-my-app'
resources:
- name: 'my_genie_space'
genie_space:
space_id: '01234567890abcdef01234567890abcd'
permission: 'CAN_RUN'
- name: 'my_vector_index'
uc_securable:
securable_full_name: 'catalog.schema.index_name'
securable_type: 'TABLE'
permission: 'SELECT'

Custom MCP server permissions

Custom MCP server permissions

If your agent connects to a custom MCP server running as a Databricks app, you must manually grant permissions since apps are not yet supported as resource dependencies in databricks.yml.

Bash
# Get your agent app's service principal
AGENT_SP=$(databricks apps get <agent-app-name> --output json | jq -r '.service_principal_name')

# Grant permission on the MCP server app
databricks apps update-permissions <mcp-server-app-name> \
--json "{\"access_control_list\": [{\"service_principal_name\": \"$AGENT_SP\", \"permission_level\": \"CAN_USE\"}]}"

Debug memory and storage issues

For agents using Lakebase for memory storage, the following issues are common:

Error

Cause

Solution

relation 'store' does not exist

Memory tables not initialized

Run await store.setup() locally before deploying to create required tables

Unable to resolve :re[LKB] instance

Wrong instance name or incorrect configuration

Verify LAKEBASE_INSTANCE_NAME uses value (not valueFrom) in app.yaml and matches the instance_name in databricks.yml

permission denied for table store

Missing Lakebase permissions

Add a database resource in databricks.yml with permission: 'CAN_CONNECT_AND_CREATE'

Memory not persisting across conversations

Different user_id per request

Ensure you pass a consistent user_id in custom_inputs for each user

Example Lakebase resource configuration:

YAML
resources:
apps:
my_agent:
resources:
- name: 'memory_database'
database:
instance_name: '<lakebase-instance-name>'
database_name: 'postgres'
permission: 'CAN_CONNECT_AND_CREATE'

Before deploying an agent with memory, initialize the tables locally:

Python
import asyncio
from databricks_langchain import AsyncDatabricksStore

async def setup_memory():
async with AsyncDatabricksStore(
instance_name='your-lakebase-instance',
embedding_endpoint='databricks-gte-large-en',
embedding_dims=1024,
) as store:
await store.setup()

asyncio.run(setup_memory())