Skip to main content

Dataset optimization and caching

AI/BI dashboards are valuable data analysis and decision-making tools, and efficient load times can significantly improve the user experience. This article explains how caching and dataset optimizations make dashboards more performant and efficient.

Query performance

You can inspect queries and their performance in the workspace query history. The query history shows SQL queries performed using SQL warehouses. Click History icon. Query History in the sidebar to view the query history. See Query history.

For dashboard datasets, Databricks applies performance optimizations depending on the result size of the dataset. For information about dataset performance thresholds, see Dataset performance thresholds.

Dataset optimizations

Your dashboards optimize for speed by performing filtering and aggregation operations, driven by filters or visualization settings, directly in your browser when possible. These performance optimizations have the following limits:

Dataset Size

Processing Behavior

Small (≤ 100K rows and ≤ 100MB)

For optimal dashboard speed, filtering and aggregation run in your browser after the initial dataset loads. Because these operations are processed locally, they avoid further interaction with the data warehouse and don't appear in the query history.

Large (> 100K rows or > 100MB)

Filtering and aggregation are handled on the backend server instead of in your browser. The initial dataset query is wrapped in a SQL WITH clause, and the resulting query appears in the query history.

Combined queries (large datasets)

For visualization queries sent to the backend, separate visualization queries against the same dataset that share the same GROUP BY clauses and filter predicates are combined into a single query for processing. In this case, users may see a combined query in the query history that fetches results for multiple visualizations or filters.

note

Parameters substitute values directly into a query at runtime, so these operations always appear in the query history.

Caching and data freshness

Dashboards maintain a 24-hour result cache to optimize initial loading times, operating on a best-effort basis. This means that while the system always attempts to use historical query results linked to dashboard credentials to enhance performance, there are some cases where cached results cannot be created or maintained. Cached data has no specific memory limit or fixed query count.

To improve loading times, dashboards first check the dashboard cache. If no cache results are available, they check the generic query result cache. While the dashboard cache can return stale results for up to 24 hours, the query result cache never returns stale data. When the underlying data changes, all query result cache entries are invalidated.

For multi-page dashboards, the following apply:

  • Editing a draft dashboard loads and caches all datasets.
  • When viewers open a published dashboard, only datasets that support the active page are run and cached.
  • If a schedule is set, all datasets refresh according to the schedule, and those results are cached.

The following table explains how caching varies by dashboard status and credentials:

Dashboard type

Caching type

Published dashboard with embedded credentials

Shared cache. All viewers see the same results.

Draft dashboard or published dashboard without embedded credentials

Per user cache. Viewers see results based on their data permissions.

Dashboards automatically use cached query results if the underlying data remains unchanged after the last query or if the results were retrieved less than 24 hours ago. If stale results exist and parameters are applied to the dashboard, queries will rerun unless the same parameters were used in the past 24 hours. Similarly, applying filters to datasets exceeding 100,000 rows prompts queries to rerun unless the same filters were previously applied in the last 24 hours.

Scheduled queries

Adding a schedule to a published dashboard with embedded credentials can significantly speed up the initial loading process for all dashboard viewers.

For each scheduled dashboard update, the following occurs:

  • All SQL logic that defines datasets runs on the designated time interval.
  • Results populate the query result cache and help to improve initial dashboard load time.