Skew join optimization using skew hints

This article describes how to use skew hints to ameliorate data skew in a table, a condition that can downgrade query performance.

Note

Skew join hints are not required. Skew is automatically taken care of if adaptive query execution (AQE) and spark.sql.adaptive.skewJoin.enabled are both enabled. See Adaptive query execution.

What is data skew?

Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade performance of queries, especially those with joins. Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. It’s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out of 200). To verify that data skew is affecting a query:

  1. Click the stage that is stuck and verify that it is doing a join.

  2. After the query finishes, find the stage that does a join and check the task duration distribution.

  3. Sort the tasks by decreasing duration and check the first few tasks. If one task took much longer to complete than the other tasks, there is skew.

To ameliorate skew, Delta Lake on Databricks SQL accepts skew hints in queries. With the information from a skew hint, Databricks Runtime can construct a better query plan, one that does not suffer from data skew.

Configure skew hint with relation name

A skew hint must contain at least the name of the relation with skew. A relation is a table, view, or a subquery. All joins with this relation then use skew join optimization.

-- table with skew
SELECT /*+ SKEW('orders') */
  *
  FROM orders, customers
  WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */
  *
  FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
  WHERE C1.c_custId = o_custId

Configure skew hint with relation name and column names

There might be multiple joins on a relation and only some of them will suffer from skew. Skew join optimization has some overhead so it is better to use it only when needed. For this purpose, the skew hint accepts column names. Only joins with these columns use skew join optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */
  *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */
  *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

Configure skew hint with relation name, column names, and skew values

You can also specify skew values in the hint. Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew join optimization. Otherwise, Delta Lake detects them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */
  *
  FROM orders, customers
  WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */
  *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */
  *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId