Optimize join performance in Databricks
With Databricks you can create joins across your batch or streaming tables. Some joins can be expensive. The following can help you optimize your joins.
For more information about joins, see Work with joins on Databricks.
Compute with Photon enabled always selects the best join type. See What is Photon?. Using a recent Databricks Runtime version with Photon enabled generally provides good join performance, but you should also consider the following recommendations:
-
Cross joins are very expensive. Remove cross joins from workloads and queries that require low latency or frequent recomputation.
-
Join order matters. When performing multiple joins, always join your smallest tables first and then join the result with larger tables.
-
The optimizer can struggle on queries with many joins and aggregations. Saving out intermediate results can accelerate query planning and computing results.
-
Keep fresh statistics to improve performance. Predictive optimization automatically updates and maintains statistics. See Predictive optimization for Unity Catalog managed tables.
You can also run the query
ANALYZE TABLE table_name COMPUTE STATISTICS
to update statistics in the query planner.
In Databricks Runtime 14.3 LTS and above, you can modify the columns that Delta Lake collects stats on for data skipping and then recompute existing statistics in the Delta log. See Specify Delta statistics columns.