Skip to main content

Lakehouse Federation performance recommendations

This article provides guidance for improving the performance of Lakehouse Federation queries.

Combine multiple predicates using the AND operator

Databricks Runtime tries to push down predicates to the remote database engine to reduce the number of records being fetched through the network. If a predicate cannot be pushed down, the query executed on the remote database engine excludes that predicate, so filtering has to be done using Databricks Runtime. However, if a certain part of the filter cannot be pushed down, another part of the filter can still be pushed down if they are joined by the AND operator.

Example 1:

Databricks query

SQL
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john'

The ILIKE expression can't be pushed down to the remote database (e.g., MySQL) because there is no appropriate translation. Filtering has to be done using Databricks Runtime.

Query that would be sent to the remote database returns all records:

SQL
SELECT * FROM catalog.schema.table

Example 2:

Databricks query

SQL
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john' AND date > '2025-05-01'

The ILIKE expression can't be pushed down to the remote database (for example, MySQL) because there is no appropriate translation, but date comparison can be. Name filtering still has to be done using Databricks Runtime, but date comparison should reduce the number of records being fetched.

Query that would be sent to the remote database returns subset of records:

SQL
SELECT * FROM catalog.schema.table WHERE date > '2025-05-01'

Check the query that should be executed on the remote database

You can use the EXPLAIN FORMATTED command to check what query would be sent to the remote database.

info

The actual query might differ from the one in the explain command output because of Adaptive Query Execution.

Set the size of batches being fetched from remote database

Connectors that use the JDBC transfer protocol can be configured to control how they fetch data from remote systems:

  • Databricks
  • Microsoft SQL Server
  • Microsoft Azure Synapse
  • MySQL
  • Oracle
  • PostgreSQL
  • Redshift
  • Salesforce Data Cloud
  • Teradata

The JDBC fetch size determines the number of rows to fetch per round trip. By default, most JDBC connectors fetch data atomically. This might cause the amount of data to exceed the available memory.

To avoid out-of-memory errors, set the fetchSize parameter. When fetchSize is set to a non-zero value, the connector reads data in batches. The maximum number of rows per batch is equal to the value of fetchSize. Databricks recommends specifying a large fetchSize value (for example, 100,000) because the overall query execution time can be prolonged if the number of rows in batches is too small.

This parameter allows the worker nodes to read the data in batches, but not in parallel.

Compute requirements:

  • You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
SQL
SELECT * FROM mySqlCatalog.schema.table WITH ('fetchSize' 100000)

Set the partition size parameter (Snowflake)

Snowflake allows fetching data in multiple partitions, which enables engagement of multiple executors and parallel processing. It is important to choose an appropriate partition size by setting the partition_size_in_mb parameter. This parameter specifies the recommended uncompressed size for each partition. To reduce the number of partitions, specify a larger value. The default value is 100 (MB).

The partition_size_in_mb parameter sets a recommended size; the actual size of partitions may vary.

Compute requirements:

  • You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
SQL
SELECT * FROM snowflakeCatalog.schema.table WITH ('partition_size_in_mb' 1000)