Skip to main content

Tracing FAQ

Q: What is the latency overhead introduced by Tracing?

Traces are written asynchronously to minimize performance impact. However, tracing still adds minimal latency, particularly when the trace size is large. MLflow recommends testing your application to understand tracing latency impacts before deploying to production.

The following table provides rough estimates for latency impact by trace size:

Trace size per request

Impact to response speed latency (ms)

~10 KB

~ 1 ms

~ 1 MB

50 ~ 100 ms

10 MB

150 ms ~

Q: What are the rate limits and quotas for MLflow Tracing in Databricks?

When using MLflow Tracing within a Databricks workspace, the following quotas and rate limits apply to ensure service stability and fair usage. These limits are per workspace.

  • Maximum Number of Traces per Experiment:

    • Default: 100,000 traces.
    • This limit can be raised substantially (e.g., above 1 million traces per experiment) upon request. Please contact Databricks support to request an increase.
  • Trace Creation Rate:

    • Limit: 200 Query Per Second (QPS) per workspace.
    • This is the rate at which new traces (and their initial spans) can be created and logged.
  • Trace Download Rate:

    • Limit: 200 QPS per workspace.
    • This applies to operations that fetch full trace data, such as mlflow.get_trace().
  • Trace Search Rate:

    • Limit: 25 QPS per workspace.
    • This applies to operations like mlflow.search_traces() that query for lists of traces based on filter criteria.

Exceeding these limits may result in throttled requests or errors. If you anticipate needing higher limits for your production workloads, please discuss your requirements with Databricks support.

Q: I cannot open my trace in the MLflow UI. What should I do?

There are multiple possible reasons why a trace may not be viewable in the MLflow UI.

  1. The trace is not completed yet: If the trace is still being collected, MLflow cannot display spans in the UI. Ensure that all spans are properly ended with either "OK" or "ERROR" status.

  2. The browser cache is outdated: When you upgrade MLflow to a new version, the browser cache may contain outdated data and prevent the UI from displaying traces correctly. Clear your browser cache (Shift+F5) and refresh the page.

Q: The model execution gets stuck and my trace is "in progress" forever.

Sometimes a model or an agent gets stuck in a long-running operation or an infinite loop, causing the trace to be stuck in the "in progress" state.

To prevent this, you can set a timeout for the trace using the MLFLOW_TRACE_TIMEOUT_SECONDS environment variable. If the trace exceeds the timeout, MLflow will automatically halt the trace with ERROR status and export it to the backend, so that you can analyze the spans to identify the issue. By default, the timeout is not set.

note

The timeout only applies to MLflow trace. The main program, model, or agent, will continue to run even if the trace is halted.

For example, the following code sets the timeout to 5 seconds and simulates how MLflow handles a long-running operation:

Python
import mlflow
import os
import time

# Set the timeout to 5 seconds for demonstration purposes
os.environ["MLFLOW_TRACE_TIMEOUT_SECONDS"] = "5"


# Simulate a long-running operation
@mlflow.trace
def long_running():
for _ in range(10):
child()


@mlflow.trace
def child():
time.sleep(1)


long_running()
note

MLflow monitors the trace execution time and expiration in a background thread. By default, this check is performed every second and resource consumption is negligible. If you want to adjust the interval, you can set the MLFLOW_TRACE_TIMEOUT_CHECK_INTERVAL_SECONDS environment variable.

Q: My trace is split into multiple traces when doing multi-threading. How can I combine them into a single trace?

As MLflow Tracing depends on Python ContextVar, each thread has its own trace context by default, but it is possible to generate a single trace for multi-threaded applications with a few additional steps. Refer to the Multi-threading section for more information.

Q: How do I temporarily disable tracing?

To disable tracing, mlflow.tracing.disable API will cease the collection of trace data from within MLflow and will not log any data to the MLflow Tracking service regarding traces.

To enable tracing (if it had been temporarily disabled), mlflow.tracing.enable API will re-enable tracing functionality for instrumented models that are invoked.

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation about related concepts.