Dataset optimization and caching

AI/BI dashboards are valuable data analysis and decision-making tools, and efficient load times can significantly improve the user experience. This article explains how caching and dataset optimizations make dashboards more performant and efficient.

Query performance

You can inspect queries and their performance in the workspace query history. The query history shows SQL queries performed using SQL warehouses. Click History Icon Query History in the sidebar to view the query history. See Query history.

For dashboard datasets, Databricks applies performance optimizations depending on the result size of the dataset.

Dataset optimizations

AI/BI dashboard datasets include the following performance optimizations:

  • If the dataset result size is small (less than 64K rows or 10MB, whichever is smaller), the dataset result is pulled to the client, and visualization-specific filtering and aggregation are performed on the client. Filtering and aggregating data for small datasets is very fast, and ensuring that your dataset is small can help you optimize dashboard performance. With small datasets, only the dataset query appears in the query history.

  • If the dataset result size is large (greater than or equal to 64K rows or over 10MB), the dataset query text is wrapped in a SQL WITH clause, and the visualization-specific filtering and aggregation is performed in a query on the backend rather than in the client. With large datasets, the visualization query appears in the query history.

  • For visualization queries sent to the backend, separate visualization queries against the same dataset that share the same GROUP BY clauses and filter predicates are combined into a single query for processing. In this case, users might see one combined query in the query history that is fetching results for multiple visualizations.

Caching and data freshness

Dashboards maintain a 24-hour result cache to optimize initial loading times, operating on a best-effort basis. This means that while the system always attempts to use historical query results linked to dashboard credentials to enhance performance, there are some cases where cached results cannot be created or maintained.

The following table explains how caching varies by dashboard status and credentials:

Dashboard type

Caching type

Published dashboard with embedded credentials

Shared cache. All viewers see the same results.

Draft dashboard or published dashboard without embedded credentials

Per user cache. Viewers see results based on their data permissions.

Dashboards automatically use cached query results if the underlying data remains unchanged after the last query or if the results were retrieved less than 24 hours ago. If stale results exist and parameters are applied to the dashboard, queries will rerun unless the same parameters were used in the past 24 hours. Similarly, applying filters to datasets exceeding 64,000 rows prompts queries to rerun unless the same filters were previously applied in the last 24 hours.

Scheduled queries

Adding a schedule to a published dashboard with embedded credentials can significantly speed up the initial loading process for all dashboard viewers.

For each scheduled dashboard update, the following occurs:

  • All SQL logic that defines datasets runs on the designated time interval.

  • Results populate the query result cache and help to improve initial dashboard load time.