Skew Join Optimization

Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade performance of queries, especially those with joins. Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. It’s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out of 200). To verify:

  1. Click the stage that is stuck and verify that it is doing a join.
  2. After the query finishes, find the stage that does a join and check the task duration distribution.
  3. Sort the tasks by decreasing duration and check the first few tasks. If one task took much longer to complete than the other tasks, there is skew.

To ameliorate skew, Databricks SQL accepts skew hints in queries. With the information from these hints, Spark can construct a better query plan, one that does not suffer from data skew.

Only relation name

A skew hint must contain at least the name of the relation with skew. A relation is a table, view, or a subquery. All joins with this relation then use skew join optimization.

-- table with skew
SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */ *
  FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
  WHERE C1.c_custId = o_custId

Relation and columns

There might be multiple joins on a relation and only some of them will suffer from skew. Skew join optimization has some overhead so it is better to use it only when needed. For this purpose, the skew hint accepts column names. Only joins with these columns use skew join optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

Relation, columns, and skew values

You can also specify skew values in the hint. Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew join optimization. Otherwise, Databricks will detect them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId