Tracing FAQ
Q: What is the latency overhead introduced by Tracing?
Traces are written asynchronously to minimize performance impact. However, tracing still adds minimal latency, particularly when the trace size is large. MLflow recommends testing your application to understand tracing latency impacts before deploying to production.
The following table provides rough estimates for latency impact by trace size:
Trace size per request | Impact to response speed latency (ms) |
---|---|
~10 KB | ~ 1 ms |
~ 1 MB | 50 ~ 100 ms |
10 MB | 150 ms ~ |
Q: What are the rate limits and quotas for MLflow Tracing in Databricks?
When using MLflow Tracing within a Databricks workspace, the following quotas and rate limits apply to ensure service stability and fair usage. These limits are per workspace.
-
Maximum Number of Traces per Experiment:
- Default: 100,000 traces.
- This limit can be raised substantially (e.g., above 1 million traces per experiment) upon request. Please contact Databricks support to request an increase.
-
Trace Creation Rate:
- Limit: 200 Query Per Second (QPS) per workspace.
- This is the rate at which new traces (and their initial spans) can be created and logged.
-
Trace Download Rate:
- Limit: 200 QPS per workspace.
- This applies to operations that fetch full trace data, such as
mlflow.get_trace()
.
-
Trace Search Rate:
- Limit: 25 QPS per workspace.
- This applies to operations like
mlflow.search_traces()
that query for lists of traces based on filter criteria.
Exceeding these limits may result in throttled requests or errors. If you anticipate needing higher limits for your production workloads, please discuss your requirements with Databricks support.
Q: I cannot open my trace in the MLflow UI. What should I do?
There are multiple possible reasons why a trace may not be viewable in the MLflow UI.
-
The trace is not completed yet: If the trace is still being collected, MLflow cannot display spans in the UI. Ensure that all spans are properly ended with either "OK" or "ERROR" status.
-
The browser cache is outdated: When you upgrade MLflow to a new version, the browser cache may contain outdated data and prevent the UI from displaying traces correctly. Clear your browser cache (Shift+F5) and refresh the page.
Q: The model execution gets stuck and my trace is "in progress" forever.
Sometimes a model or an agent gets stuck in a long-running operation or an infinite loop, causing the trace to be stuck in the "in progress" state.
To prevent this, you can set a timeout for the trace using the MLFLOW_TRACE_TIMEOUT_SECONDS
environment variable. If the trace exceeds the timeout, MLflow will automatically halt the trace with ERROR
status and export it to the backend, so that you can analyze the spans to identify the issue. By default, the timeout is not set.
The timeout only applies to MLflow trace. The main program, model, or agent, will continue to run even if the trace is halted.
For example, the following code sets the timeout to 5 seconds and simulates how MLflow handles a long-running operation:
import mlflow
import os
import time
# Set the timeout to 5 seconds for demonstration purposes
os.environ["MLFLOW_TRACE_TIMEOUT_SECONDS"] = "5"
# Simulate a long-running operation
@mlflow.trace
def long_running():
for _ in range(10):
child()
@mlflow.trace
def child():
time.sleep(1)
long_running()
MLflow monitors the trace execution time and expiration in a background thread. By default, this check is performed every second and resource consumption is negligible. If you want to adjust the interval, you can set the MLFLOW_TRACE_TIMEOUT_CHECK_INTERVAL_SECONDS
environment variable.
Q: My trace is split into multiple traces when doing multi-threading. How can I combine them into a single trace?
As MLflow Tracing depends on Python ContextVar, each thread has its own trace context by default, but it is possible to generate a single trace for multi-threaded applications with a few additional steps. Refer to the Multi-threading section for more information.
Q: How do I temporarily disable tracing?
To disable tracing, mlflow.tracing.disable
API will cease the collection of trace data from within MLflow and will not log
any data to the MLflow Tracking service regarding traces.
To enable tracing (if it had been temporarily disabled), mlflow.tracing.enable
API will re-enable tracing functionality for instrumented models
that are invoked.
Next steps
Continue your journey with these recommended actions and tutorials.
- Instrument your app with tracing - Learn how to add tracing to your application
- Production observability with tracing - Set up tracing for production environments
- Debug & observe your app - Use traces to troubleshoot issues
Reference guides
Explore detailed documentation about related concepts.
- Tracing data model - Understand the structure of traces and spans
- Tracing concepts - Learn the fundamentals of MLflow Tracing
- Manual tracing APIs - Explore advanced tracing techniques