Skew Join OptimizationΒΆ

Data skew can severly downgrade performance of queries, especially those with joins. Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster.

To help with this, Spark SQL accepts skew hints in queries. With the information from these hints, Spark can construct a better query plan, one that does not suffer from data skew.

Only relation name

A skew hint must contain at least the name of the relation with skew. A relation is a table, view or a subquery. All joins with this relation will then use the Skew Join Optimization.

-- table with skew
SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */ *
        FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
        WHERE C1.c_custId = o_custId

Relation and columns

There might be multiple joins on a relation and only some of them will suffer from skew. The Skew Join Optimization has some overhead so it is better to use it only when needed. For this purpose, the skew hint can also accept column names. Only joins with these columns will use the Skew Join Optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
        FROM orders, customers
        WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
        FROM orders, customers
        WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

Relation, columns and skew values

Finally, the actual skew values can also be specified in the hint. Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of the Skew Join Optimization. Otherwise, Spark will detect them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
        FROM orders, customers
        WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
        FROM orders, customers
        WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
        FROM orders, customers
        WHERE o_custId = c_custId AND o_storeRegionId = c_regionId