How to Specify Skew Hints in Dataset and DataFrame-based Join Commands

When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization is performed on the DataFrame for which you specify the skew hint.

In addition to the basic hint, you can specify the hint method with the following combinations of parameters: column name, list of column names, and column name and skew value.

  • DataFrame and column name. The skew join optimization is performed on the specified column of the DataFrame.

    df.hint("skew", "col1")
    
  • DataFrame and multiple columns. The skew join optimization is performed for multiple columns in the DataFrame.

    df.hint("skew", ["col1","col2"])
    
  • DataFrame, column name, and skew value. The skew join optimization is performed on the data in the column with the skew value.

    df.hint("skew", "col1", "value")
    

Example

This example shows how to specify the skew hint for multiple DataFrame objects involved in a join operation:

val joinResults = ds1.hint("skew").as("L").join(ds2.hint("skew").as("R"), $"L.col1" === $"R.col1")