Monitor Model Serving endpoints with Prometheus and Datadog

This article shows how to use the metrics export API to set up endpoint metric collection and monitoring with Prometheus and Datadog.

Requirements

  • Read access to the desired endpoint and personal access token (PAT) which can be generated in User Settings in the Databricks Machine Learning UI to access the endpoint.

  • An existing Model Serving endpoint. You can validate this by checking the endpoint health with the following:

    curl -n -X GET -H "Authorization: Bearer [PAT]" https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]
    
  • Validate the export metrics API:

    curl -n -X GET -H "Authorization: Bearer [PAT]" https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics
    

Prometheus integration

Note

Regardless of which type of deployment you have in your production environment, the scraping configuration should be similar.

The guidance in this section follows the Prometheus documentation to start a Prometheus service locally using docker.

  1. Write a yaml config file and name it prometheus.yml. The following is an example:

     global:
      scrape_interval: 1m
      scrape_timeout: 10s
     scrape_configs:
      - job_name: "prometheus"
        metrics_path: "/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics"
        scheme: "https"
        authorization:
         type: "Bearer"
         credentials: "[PAT_TOKEN]"
    
       static_configs:
         - targets: ["dbc-741cfa95-12d1.dev.databricks.com"]
    
  2. Start Prometheus locally with the following command:

       docker run \
       -p 9090:9090 \
       -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
       prom/prometheus
    
  3. Navigate to http://localhost:9090 to check if your local Prometheus service is up and running.

  4. Check the Prometheus scraper status and debug errors from: http://localhost:9090/targets?search=

  5. Once the target is fully up and running, you can query the provided metrics, like cpu_usage_percentage or mem_usage_percentage, in the UI.

Datadog integration

Note

The preliminary set up for this example is based on the free edition.

Datadog has a variety of agents that can be deployed in different environments. For demonstration purposes, the following launches a Mac OS agent locally that scrapes the metrics endpoint in your Databricks host. The configuration for using other agents should be in a similar pattern.

  1. Register a datadog account.

  2. Install OpenMetrics integration in your account dashboard, so Datadog can accept and process OpenMetrics data.

  3. Follow the Datadog documentation to get your Datadog agent up and running. For this example, use the DMG package option to have everything installed including launchctl and datadog-agent.

  4. Locate your OpenMetrics configuration. For this example, the configuration is at ~/.datadog-agent/conf.d/openmetrics.d/conf.yaml.default. The following is an example configuration yaml file.

     instances:
      - openmetrics_endpoint: https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics
    
       metrics:
       - cpu_usage_percentage:
           name: cpu_usage_percentage
           type: gauge
       - mem_usage_percentage:
           name: mem_usage_percentage
           type: gauge
       - provisioned_concurrent_requests_total:
           name: provisioned_concurrent_requests_total
           type: gauge
       - request_4xx_count_total:
           name: request_4xx_count_total
           type: gauge
       - request_5xx_count_total:
           name: request_5xx_count_total
           type: gauge
       - request_count_total:
           name: request_count_total
           type: gauge
       - request_latency_ms:
           name: request_latency_ms
           type: histogram
    
       tag_by_endpoint: false
    
       send_distribution_buckets: true
    
       headers:
         Authorization: Bearer [PAT]
         Content-Type: application/openmetrics-text
    
  5. Start datadog agent using launchctl start com.datadoghq.agent.

  6. Every time you need to make changes to your config, you need to restart the agent to pick up the change.

     launchctl stop com.datadoghq.agent
     launchctl start com.datadoghq.agent
    
  7. Check the agent health with datadog-agent health.

  8. Check agent status with datadog-agent status. You should be able to see a response like the following. If not, debug with the error message. Potential issues may be due to an expired PAT token, or an incorrect URL.

     openmetrics (2.2.2)
     -------------------
       Instance ID: openmetrics: xxxxxxxxxxxxxxxx [OK]
       Configuration Source: file:/opt/datadog-agent/etc/conf.d/openmetrics.d/conf.yaml.default
       Total Runs: 1
       Metric Samples: Last Run: 2, Total: 2
       Events: Last Run: 0, Total: 0
       Service Checks: Last Run: 1, Total: 1
       Average Execution Time : 274ms
       Last Execution Date : 2022-09-21 23:00:41 PDT / 2022-09-22 06:00:41 UTC (xxxxxxxx)
       Last Successful Execution Date : 2022-09-21 23:00:41 PDT / 2022-09-22 06:00:41 UTC (xxxxxxx)
    
  9. Agent status can also be seen from the UI at:http://127.0.0.1:5002/.

    If your agent is fully up and running, you can navigate back to your Datadog dashboard to query the metrics. You can also create a monitor or alert based on the metric data:https://app.datadoghq.com/monitors/create/metric.