Monitor Model Serving endpoints with Prometheus and Datadog
This article shows how to use the metrics export API to set up endpoint metric collection and monitoring with Prometheus and Datadog.
Requirements
Read access to the desired endpoint and personal access token (PAT) which can be generated in User Settings in the Databricks Machine Learning UI to access the endpoint.
An existing Model Serving endpoint. You can validate this by checking the endpoint health with the following:
curl -n -X GET -H "Authorization: Bearer [PAT]" https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]
Validate the export metrics API:
curl -n -X GET -H "Authorization: Bearer [PAT]" https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics
Prometheus integration
Note
Regardless of which type of deployment you have in your production environment, the scraping configuration should be similar.
The guidance in this section follows the Prometheus documentation to start a Prometheus service locally using docker.
Write a
yaml
config file and name itprometheus.yml
. The following is an example:global: scrape_interval: 1m scrape_timeout: 10s scrape_configs: - job_name: "prometheus" metrics_path: "/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics" scheme: "https" authorization: type: "Bearer" credentials: "[PAT_TOKEN]" static_configs: - targets: ["dbc-741cfa95-12d1.dev.databricks.com"]
Start Prometheus locally with the following command:
docker run \ -p 9090:9090 \ -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
Navigate to
http://localhost:9090
to check if your local Prometheus service is up and running.Check the Prometheus scraper status and debug errors from:
http://localhost:9090/targets?search=
Once the target is fully up and running, you can query the provided metrics, like
cpu_usage_percentage
ormem_usage_percentage
, in the UI.
Datadog integration
Note
The preliminary set up for this example is based on the free edition.
Datadog has a variety of agents that can be deployed in different environments. For demonstration purposes, the following launches a Mac OS agent locally that scrapes the metrics endpoint in your Databricks host. The configuration for using other agents should be in a similar pattern.
Register a datadog account.
Install OpenMetrics integration in your account dashboard, so Datadog can accept and process OpenMetrics data.
Follow the Datadog documentation to get your Datadog agent up and running. For this example, use the DMG package option to have everything installed including
launchctl
anddatadog-agent
.Locate your OpenMetrics configuration. For this example, the configuration is at
~/.datadog-agent/conf.d/openmetrics.d/conf.yaml.default
. The following is an example configurationyaml
file.instances: - openmetrics_endpoint: https://[DATABRICKS_HOST]/api/2.0/serving-endpoints/[ENDPOINT_NAME]/metrics metrics: - cpu_usage_percentage: name: cpu_usage_percentage type: gauge - mem_usage_percentage: name: mem_usage_percentage type: gauge - provisioned_concurrent_requests_total: name: provisioned_concurrent_requests_total type: gauge - request_4xx_count_total: name: request_4xx_count_total type: gauge - request_5xx_count_total: name: request_5xx_count_total type: gauge - request_count_total: name: request_count_total type: gauge - request_latency_ms: name: request_latency_ms type: histogram tag_by_endpoint: false send_distribution_buckets: true headers: Authorization: Bearer [PAT] Content-Type: application/openmetrics-text
Start datadog agent using
launchctl start com.datadoghq.agent
.Every time you need to make changes to your config, you need to restart the agent to pick up the change.
launchctl stop com.datadoghq.agent launchctl start com.datadoghq.agent
Check the agent health with
datadog-agent health
.Check agent status with
datadog-agent status
. You should be able to see a response like the following. If not, debug with the error message. Potential issues may be due to an expired PAT token, or an incorrect URL.openmetrics (2.2.2) ------------------- Instance ID: openmetrics: xxxxxxxxxxxxxxxx [OK] Configuration Source: file:/opt/datadog-agent/etc/conf.d/openmetrics.d/conf.yaml.default Total Runs: 1 Metric Samples: Last Run: 2, Total: 2 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 1, Total: 1 Average Execution Time : 274ms Last Execution Date : 2022-09-21 23:00:41 PDT / 2022-09-22 06:00:41 UTC (xxxxxxxx) Last Successful Execution Date : 2022-09-21 23:00:41 PDT / 2022-09-22 06:00:41 UTC (xxxxxxx)
Agent status can also be seen from the UI at:http://127.0.0.1:5002/.
If your agent is fully up and running, you can navigate back to your Datadog dashboard to query the metrics. You can also create a monitor or alert based on the metric data:https://app.datadoghq.com/monitors/create/metric.