Skip to main content

Lakehouse Federation performance recommendations

This article provides guidance for improving the performance of Lakehouse Federation queries.

Combine multiple predicates using the AND operator

Databricks Runtime tries to push down predicates to the remote database engine to reduce the number of records being fetched through the network. If a predicate cannot be pushed down, the query executed on the remote database engine excludes that predicate, so filtering has to be done using Databricks Runtime. However, if a certain part of the filter cannot be pushed down, another part of the filter can still be pushed down if they are joined by the AND operator.

Example 1:

Databricks query

SQL
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john'

The ILIKE expression can't be pushed down to the remote database (e.g., MySQL) because there is no appropriate translation. Filtering has to be done using Databricks Runtime.

Query that would be sent to the remote database returns all records:

SQL
SELECT * FROM catalog.schema.table

Example 2:

Databricks query

SQL
SELECT * FROM foreign_catalog.schema.table WHERE name ILIKE 'john' AND date > '2025-05-01'

The ILIKE expression can't be pushed down to the remote database (for example, MySQL) because there is no appropriate translation, but date comparison can be. Name filtering still has to be done using Databricks Runtime, but date comparison should reduce the number of records being fetched.

Query that would be sent to the remote database returns subset of records:

SQL
SELECT * FROM catalog.schema.table WHERE date > '2025-05-01'

Check the query that should be executed on the remote database

You can use the EXPLAIN FORMATTED command to check what query would be sent to the remote database.

info

The actual query might differ from the one in the explain command output because of Adaptive Query Execution.

Set the size of batches being fetched from remote database

Connectors that use the JDBC transfer protocol can be configured to control how they fetch data from remote systems:

  • Databricks
  • Microsoft SQL Server
  • Microsoft Azure Synapse
  • MySQL
  • Oracle
  • PostgreSQL
  • Redshift
  • Salesforce Data Cloud
  • Teradata

The JDBC fetch size determines the number of rows to fetch per round trip. By default, most JDBC connectors fetch data atomically. This might cause the amount of data to exceed the available memory.

To avoid out-of-memory errors, set the fetchSize parameter. When fetchSize is set to a non-zero value, the connector reads data in batches. The maximum number of rows per batch is equal to the value of fetchSize. Databricks recommends specifying a large fetchSize value (for example, 100,000) because the overall query execution time can be prolonged if the number of rows in batches is too small.

This parameter allows the worker nodes to read the data in batches, but not in parallel.

Compute requirements:

  • You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
SQL
SELECT * FROM mySqlCatalog.schema.table WITH ('fetchSize' 100000)

Set the partition size parameter (Snowflake)

Snowflake allows fetching data in multiple partitions, which enables engagement of multiple executors and parallel processing. It is important to choose an appropriate partition size by setting the partition_size_in_mb parameter. This parameter specifies the recommended uncompressed size for each partition. To reduce the number of partitions, specify a larger value. The default value is 100 (MB).

The partition_size_in_mb parameter sets a recommended size; the actual size of partitions may vary.

Compute requirements:

  • You must use compute on Databricks Runtime 16.1 or above. SQL warehouses must be Pro or Serverless and must use 2024.50.
SQL
SELECT * FROM snowflakeCatalog.schema.table WITH ('partition_size_in_mb' 1000)

Enable parallel reads for JDBC connectors

Connectors that support the JDBC transfer protocol can read data in parallel by partitioning the query. You can configure parallel reads for the following connectors:

  • Databricks
  • Microsoft SQL Server
  • Microsoft Azure Synapse
  • MySQL
  • Oracle
  • PostgreSQL
  • Redshift
  • Salesforce Data Cloud
  • Teradata

This allows multiple executors to fetch data concurrently, significantly improving performance for large tables.

To enable parallel reads, specify the following parameters:

  • numPartitions: The number of partitions to be used for parallelism
  • partitionColumn: The name of a numeric column used to partition the query
  • lowerBound: The minimum value of partitionColumn used to decide partition stride
  • upperBound: The maximum value of partitionColumn used to decide partition stride
info

The lowerBound and upperBound values are used only to decide the partition stride, not for filtering the rows in the table. All rows in the table will be partitioned and returned.

The partition column should be:

  • A numeric column
  • Evenly distributed across the range
  • An indexed column for better performance

Compute requirements:

  • You must use compute on Databricks Runtime 17.1 or above. SQL warehouses must be pro or serverless and must use 2025.25.

In the following example, the query will be split into 4 parallel partitions based on the id column, with each partition processing a range of approximately 250 IDs (assuming that there is a single record for each id between 1 and 1000).

SQL
SELECT * FROM mySqlCatalog.schema.table WITH (
'numPartitions' 4,
'partitionColumn' 'id',
'lowerBound' 1,
'upperBound' 1000
)