Hints

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Suggest specific approaches to generate an execution plan.

Syntax

/*+ { partition_hint | join_hint | skew_hint } [, ...] */

Partitioning hints

Partitioning hints allow you to suggest a partitioning strategy that Databricks should follow. COALESCE, REPARTITION, and REPARTITION_BY_RANGE hints are supported and are equivalent to coalesce, repartition, and repartitionByRange Dataset APIs, respectively. These hints give you a way to tune performance and control the number of output files. When multiple partitioning hints are specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer.

Syntax

partition_hint
  COALESCE ( part_num ) |
  REPARTITION ( { part_num | [ part_num , ] column_name [ , ...] } ) |
  REPARTITION_BY_RANGE ( column_name [, ...] [, part_num] ) |
  REBALANCE [ ( column_name [, ...] ) ]

Partitioning hint types

  • COALESCE ( part_num )

    Reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter.

  • REPARTITION ( { part_num | [ part_num , ] column_name [ , ...] } )

    Repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters.

  • REPARTITION_BY_RANGE ( column_name [, ...] [, part_num]

    Repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters.

  • REBALANCE [ ( column_name [, ...] ) ]

    The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is of a reasonable size (not too small and not too big). It can take column names as parameters, and try its best to partition the query result by these columns. This is a best-effort: if there are skews, Spark will split the skewed partitions, to make these partitions not too big. This hint is useful when you need to write the result of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not enabled.

  • column_name

    The exposed name of a column or alias to repartition or rebalance by.

  • part_num

    An integral number literal. The number of partitions to repartition to.

Examples

> SELECT /*+ COALESCE(3) */ * FROM t;

> SELECT /*+ REPARTITION(3) */ * FROM t;

> SELECT /*+ REPARTITION(c) */ * FROM t;

> SELECT /*+ REPARTITION(3, c) */ * FROM t;

> SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t;

> SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t;

> SELECT /*+ REBALANCE */ * FROM t;

> SELECT /*+ REBALANCE(c) */ * FROM t;

-- When a column name has been occluded by an alias you must refere to it by the alias name.
> SELECT /*+ REBALANCE(d) */ * FROM t AS s(d);

-- multiple partitioning hints
> EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t;
== Parsed Logical Plan ==
'UnresolvedHint REPARTITION, [100]
+- 'UnresolvedHint COALESCE, [500]
   +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c]
      +- 'Project [*]
         +- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
name: string, c: int
Repartition 100, true
+- Repartition 500, false
   +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3
      +- Project [name#29, c#30]
         +- SubqueryAlias spark_catalog.default.t
            +- Relation[name#29,c#30] parquet

== Optimized Logical Plan ==
Repartition 100, true
+- Relation[name#29,c#30] parquet

== Physical Plan ==
Exchange RoundRobinPartitioning(100), false, [id=#121]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet,
      Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [],
      PushedFilters: [], ReadSchema: struct<name:string>

Join hints

Join hints allow you to suggest the join strategy that Databricks SQL should use. When different join strategy hints are specified on both sides of a join, Databricks SQL prioritizes hints in the following order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL. When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint, Databricks SQL picks the build side based on the join type and the sizes of the relations. Because a given strategy might not support all join types, Databricks SQL is not guaranteed to use the join strategy suggested by the hint.

Syntax

join_hint
  BROADCAST ( table_name ) |
  MERGE ( table_name ) |
  SHUFFLE_HASH ( table_name ) |
  SHUFFLE_REPLICATE_NL ( table_name )

BROADCASTJOIN and MAPJOIN are supported as aliases for BROADCAST.

SHUFFLE_MERGE and MERGEJOIN are supported as aliases for MERGE.

Join hint types

  • BROADCAST ( table_name )

    Use broadcast join. The join side with the hint is broadcast regardless of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) is broadcast.

  • MERGE ( table_name )

    Use shuffle sort merge join.

  • SHUFFLE_HASH ( table_name )

    Use shuffle hash join. If both sides have the shuffle hash hints, Databricks SQL chooses the smaller side (based on stats) as the build side.

  • SHUFFLE_REPLICATE_NL ( table_name )

    Use shuffle-and-replicate nested loop join.

  • table_name

    The exposed name of a table or table-alias to which the hint applies.

Examples

-- Join Hints for broadcast join
> SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle sort merge join
> SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
> SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle hash join
> SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- Join Hints for shuffle-and-replicate nested loop join
> SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- When different join strategy hints are specified on both sides of a join, Databricks SQL
-- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint
-- over the SHUFFLE_REPLICATE_NL hint.
-- Databricks SQL will issue Warning in the following example
-- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge)
-- is overridden by another hint and will not take effect.
> SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;

-- When a table name is occluded by an alias you must use the alias name in the hint
> SELECT /*+ BROADCAST(t1), MERGE(s1, s2) */ * FROM t1 AS s1 INNER JOIN t2 AS s2 ON s1.key = s2.key;

Skew hints

(Delta Lake) See Skew join optimization using skew hints for information about the SKEW hint.