A range join occurs when two relations are joined using a point in interval or interval overlap condition. The range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance, but requires careful manual tuning.
A point in interval range join is a join in which the condition contains predicates specifying that a value from one relation is between two values from the other relation. For example:
-- using BETWEEN expressions SELECT * FROM points JOIN ranges ON points.p BETWEEN ranges.start and ranges.end; -- using inequality expressions SELECT * FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end; -- with fixed length interval SELECT * FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.start + 100; -- join two sets of point values within a fixed distance from each other SELECT * FROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10; -- a range condition together with other join conditions SELECT * FROM points, ranges WHERE points.symbol = ranges.symbol AND points.p >= ranges.start AND points.p < ranges.end;
An interval overlap range join is a join in which the condition contains predicates specifying an overlap of intervals between two values from each relation. For example:
-- overlap of [r1.start, r1.end] with [r2.start, r2.end] SELECT * FROM r1 JOIN r2 ON r1.start < r2.end AND r2.start < r1.end; -- overlap of fixed length intervals SELECT * FROM r1 JOIN r2 ON r1.start < r2.start + 100 AND r2.start < r1.start + 100; -- a range condition together with other join conditions SELECT * FROM r1 JOIN r2 ON r1.symbol = r2.symbol AND r1.start <= r2.end AND r1.end >= r2.start;
The range join optimization is performed for joins that:
- Have a condition that can be interpreted as a point in interval or interval overlap range join.
- All values involved in the range join condition are of a numeric type (integral, floating point, decimal),
- All values involved in the range join condition are of the same type. In the case of the decimal type, the values also need to be of the same scale and precision.
- It is an
INNER JOIN, or in case of point in interval range join, a
LEFT OUTER JOINwith point value on the left side, or
RIGHT OUTER JOINwith point value on the right side.
- Have a bin size tuning parameter.
The bin size is a numeric tuning parameter that splits the values domain of the range condition into multiple bins of equal size. For example, with a bin size of 10, the optimization splits the domain into bins that are intervals of length 10.
If you have a point in range condition of
p BETWEEN start AND end, and
start is 8 and
end is 22, this value interval overlaps with three bins of length 10 – the first bin from 0 to 10, the second bin from 10 to 20, and the third bin from 20 to 30. Only the points that fall within the same three bins need to be considered as possible join matches for that interval. For example, if
p is 32, it can be ruled out as falling between
start of 8 and
end of 22, because it falls in the bin from 30 to 40.
DATEvalues, the value of the bin size is interpreted as days. For example, a bin size value of 7 represents a week.
TIMESTAMPvalues, the value of the bin size is interpreted as seconds. If a sub-second value is required, fractional values can be used. For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100 milliseconds.
You can specify the bin size either by using a range join hint in the query or by setting a session configuration parameter. The range join optimization is applied only if you manually specify the bin size. Section Choose the bin size describes how to choose an optimal bin size.
To enable the range join optimization in a SQL query, you can use a range join hint to specify the bin size. The hint must contain the relation name of one of the joined relations and the numeric bin size parameter. The relation name can be a table, a view, or a subquery.
SELECT /*+ RANGE_JOIN(points, 10) */ * FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end; SELECT /*+ RANGE_JOIN(r1, 0.1) */ * FROM (SELECT * FROM ranges WHERE ranges.amount < 100) r1, ranges r2 WHERE r1.start < r2.start + 100 AND r2.start < r1.start + 100; SELECT /*+ RANGE_JOIN(c, 500) */ * FROM a JOIN b ON (a.b_key = b.id) JOIN c ON (a.ts BETWEEN c.start_time AND c.end_time)
In the third example, you must place the hint on
This is because joins are left associative, so the query is interpreted as
(a JOIN b) JOIN c,
and the hint on
a applies to the join of
b and not the join with
You can also place a range join hint on one of the joined DataFrames. In that case, the hint contains just the numeric bin size parameter.
val df1 = spark.table("ranges").as("left") val df2 = spark.table("ranges").as("right") val joined = df1.hint("range_join", 10) .join(df2, $"left.type" === $"right.type" && $"left.end" > $"right.start" && $"left.start" < $"right.end") val joined2 = df1 .join(df2.hint("range_join", 0.5), $"left.type" === $"right.type" && $"left.end" > $"right.start" && $"left.start" < $"right.end")
If you don’t want to modify the query, you can specify the bin size as a configuration parameter.
This configuration parameter applies to any join with a range condition. However, a different bin size set through a range join hint always overrides the one set through the parameter.
The effectiveness of the range join optimization depends on choosing the appropriate bin size.
A small bin size results in a larger number of bins, which helps in filtering the potential matches.
However, it becomes inefficient if the bin size is significantly smaller than the encountered value intervals, and the value intervals overlap multiple bin intervals. For example, with a condition
p BETWEEN start AND end, where
start is 1,000,000 and
end is 1,999,999, and a bin size of 10, the value interval overlaps with 100,000 bins.
If the length of the interval is fairly uniform and known, we recommend that you set the bin size to the typical expected length of the value interval. However, if the length of the interval is varying and skewed, a balance must be found to set a bin size that filters the short intervals efficiently, while preventing the long intervals from overlapping too many bins. Assuming a table
ranges, with intervals that are between columns
end, you can determine different percentiles of the skewed interval length value with the following query:
SELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges
A recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. The rationale is:
- If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the bin interval, so span more than 2 adjacent bin intervals.
- If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11 adjacent bin intervals.
- If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101 adjacent bin intervals.
- The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed.
The described method limits the amount of skewed long value intervals that overlap multiple bin intervals. The bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific workload.