Lakehouse Federation performance recommendations
This article provides guidance for improving the performance of Lakehouse Federation queries.
Combine multiple predicates using the AND operator
Databricks Runtime tries to push down predicates to the remote database engine to reduce the number of records being fetched through the network. If a predicate cannot be pushed down, the query executed on the remote database engine excludes the predicate, so filtering has to be done using Databricks Runtime. However, if a certain part of the filter cannot be pushed down, another part of the filter can still be pushed down if they are joined by the AND operator.
Example 1
Databricks query:
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john'
The ILIKE expression can't be pushed down to the remote database (for example, MySQL) because there is no appropriate translation. Filtering has to be done using Databricks Runtime.
The query sent to the remote database returns all records:
SELECT * FROM catalog.schema.table
Example 2
Databricks query:
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john' AND date > '2025-05-01'
The ILIKE expression can't be pushed down to the remote database (for example, MySQL) because there is no appropriate translation, but date comparison can be pushed down. Name filtering still has to be done using Databricks Runtime, but date comparison should reduce the number of records being fetched.
The query sent to the remote database returns a subset of records:
SELECT * FROM catalog.schema.table WHERE date > '2025-05-01'
Check which query will run on the remote database
To see which query will be sent to the remote database, run the EXPLAIN FORMATTED command.
The actual query might differ from the query in the EXPLAIN FORMATTED output because of adaptive query execution.
Set the size of batches fetched from the remote database
You can configure the following connectors, which use the JDBC transfer protocol, to control how they fetch data from remote systems.
- Databricks
 - Microsoft SQL Server
 - Microsoft Azure Synapse
 - MySQL
 - Oracle
 - PostgreSQL
 - Redshift
 - Salesforce Data 360
 - Teradata
 
The JDBC fetch size determines the number of rows to fetch per round trip. By default, most JDBC connectors fetch data atomically. This might cause the amount of data to exceed the available memory.
To avoid out-of-memory errors, set the fetchSize parameter. When fetchSize is set to a non-zero value, the connector reads data in batches. The maximum number of rows per batch is equal to the value of fetchSize. Databricks recommends specifying a large fetchSize value (for example, 100,000) because the overall query execution time can be prolonged if the number of rows in batches is too small.
This parameter allows the worker nodes to read the data in batches, but not in parallel.
Compute requirements:
- You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
 
SELECT * FROM mySqlCatalog.schema.table WITH ('fetchSize' 100000)
Set the partition size parameter (Snowflake)
Snowflake allows fetching data in multiple partitions, which enables engagement of multiple executors and parallel processing.
It is important to choose an appropriate partition size by setting the partition_size_in_mb parameter.
This parameter specifies the recommended uncompressed size for each partition. To reduce the number of partitions, specify a larger value.
The default value is 100 (MB).
The partition_size_in_mb parameter sets a recommended size; the actual size of partitions may vary.
Compute requirements:
- You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
 
SELECT * FROM snowflakeCatalog.schema.table WITH ('partition_size_in_mb' 1000)
Enable parallel reads for JDBC connectors
Connectors that support the JDBC transfer protocol can read data in parallel by partitioning the query. You can configure parallel reads for the following connectors:
- Databricks
 - Microsoft SQL Server
 - Microsoft Azure Synapse
 - MySQL
 - Oracle
 - PostgreSQL
 - Redshift
 - Salesforce Data 360
 - Teradata
 
This allows multiple executors to fetch data concurrently, significantly improving performance for large tables.
To enable parallel reads, specify the following parameters:
numPartitions: The number of partitions to be used for parallelismpartitionColumn: The name of a numeric column used to partition the querylowerBound: The minimum value ofpartitionColumnused to decide partition strideupperBound: The maximum value ofpartitionColumnused to decide partition stride
The lowerBound and upperBound values are used only to decide the partition stride, not for filtering the rows in the table. All rows in the table will be partitioned and returned.
The partition column should be:
- A numeric column
 - Evenly distributed across the range
 - An indexed column for better performance
 
Compute requirements:
- You must use compute on Databricks Runtime 17.1 or above. SQL warehouses must be pro or serverless and must use 2025.25.
 
In the following example, the query will be split into 4 parallel partitions based on the id column, with each partition processing a range of approximately 250 IDs (assuming that there is a single record for each id between 1 and 1000).
SELECT * FROM mySqlCatalog.schema.table WITH (
  'numPartitions' 4,
  'partitionColumn' 'id',
  'lowerBound' 1,
  'upperBound' 1000
)
Join pushdown in Lakehouse Federation
This feature is in Public Preview.
Learn how join pushdown works in Databricks Lakehouse Federation.
Join pushdown overview
Join pushdown is a query optimization technique where Databricks pushes join operations to the remote database engine instead of fetching data and performing the join locally. This significantly reduces network traffic and improves query performance by leveraging the remote database's built-in join capabilities.
Supported data sources
The following data sources support join pushdown:
- Oracle
 - PostgreSQL
 - MySQL
 - SQL Server
 - Teradata
 - Redshift
 - Snowflake
 - BigQuery
 
This feature is Generally Available and enabled by default for Redshift, Snowflake, and BigQuery. The following limitations and requirements only apply to the Oracle, PostgreSQL, MySQL, SQL Server, and Teradata connectors.
Requirements
- You must use compute on Databricks Runtime 17.2 or above.
 - SQL warehouses must be Pro or Serverless and must use 2025.30.
 - On the Previews page of the Databricks UI, you must toggle Join Pushdown for Federated Queries (Public Preview) on.
 
Limitations
- Only inner, left-outer, and right-outer joins are supported.
 - Aliases in a join's children are only supported in DBR 17.3 and above.
 
Node hierarchy requirements
For a join to be pushed down, all nodes in the left and right child branches must also be pushable. The following rules apply:
- Supported child nodes: Only joins, filters, sample, and scan nodes can appear below a join in the query plan for pushdown to succeed.
 - Unsupported child nodes: If limit, offset, or aggregate operations appear in either the left or right branch below a join, the join cannot be pushed down.
 - Operations on top of joins: Aggregate, limit, and offset operations can be pushed down when applied on top of a join.
 
Examples
-- Supported: Join two table scans
SELECT *
FROM table1
INNER JOIN table2
ON col_from_table1 = col_from_table2 + 1
-- Supported: Join two table scans with a nested select query
SELECT *
FROM (SELECT a FROM table1) q1
INNER JOIN (SELECT a FROM table2) q2
ON q1.a = q2.a + 1
-- Supported: Child subqueries with aliases in projection (DBR 17.3+)
SELECT *
FROM (SELECT a AS a1 FROM table1) t1
INNER JOIN (SELECT a AS a2 FROM table2) t2
ON t1.a1 = t2.a2 + 1
-- Supported: Join with filters below
SELECT *
FROM (SELECT * FROM table1 WHERE a > 10) t1
INNER JOIN (SELECT * FROM table2 WHERE b < 20) t2
ON t1.id = t2.id
-- Supported: Aggregate on top of join
SELECT COUNT(*)
FROM table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id
-- Not supported: Join on top of aggregate
SELECT *
FROM (SELECT id, COUNT(*) as cnt FROM table1 GROUP BY id) t1
INNER JOIN table2 t2
ON t1.id = t2.id
-- Not supported: Join on top of limit
SELECT *
FROM (SELECT * FROM table1 LIMIT 100) t1
INNER JOIN table2 t2
ON t1.id = t2.id
Observability
Use EXPLAIN FORMATTED to verify that your join is being pushed down:
EXPLAIN FORMATTED
SELECT *
FROM foreign_catalog.schema.table1 t1
INNER JOIN foreign_catalog.schema.table2 t2
ON t1.id = t2.id
Example output showing successful join pushdown:
== Physical Plan ==
*(1) Scan JDBCRelation
PushedFilters: [id = id_1],
PushedJoins:
   [L]: Relation: foreign_catalog.schema.table1
        PushedFilters: [ID IS NOT NULL]
   [R]: Relation: foreign_catalog.schema.table2
        PushedFilters: [ID IS NOT NULL]
In this output:
id_1is an alias that Databricks automatically generates to resolve ambiguity when columns have duplicate names.- The 
PushedFiltersabovePushedJoinsrepresent the actual join conditions being pushed to the remote database. - The 
PushedFiltersin each relation ([L] and [R]) show additional filter predicates applied to each table.