Run queries using Lakehouse Federation

Preview

This feature is in Public Preview.

This article introduces Lakehouse Federation, the query federation platform that enables you to use Databricks to run queries against multiple external data sources. It also describes how to set up Lakehouse Federation connections and create foreign catalogs in your Unity Catalog metastore.

What is Lakehouse Federation?

Lakehouse Federation is the query federation platform for Databricks. The term query federation describes a collection of features that enable users and systems to run queries against multiple data sources without needing to migrate all data to a unified system.

Databricks uses Unity Catalog to manage query federation. You configure read-only connections to popular database solutions using drivers that are included on Pro SQL Warehouses, Serverless SQL Warehouses, and Databricks Runtime clusters. Unity Catalog’s data governance and data lineage tools ensure that data access is managed and audited for all federated queries made by the users in your Databricks workspaces.

Why use Lakehouse Federation?

The lakehouse emphasizes central storage of data to reduce data redundancy and isolation, but your organization may have numerous data systems in production, and you might want to query data in connected systems for a number of reasons:

  • Ad hoc reporting.

  • Proof-of-concept work.

  • The exploratory phase of new ETL pipelines or reports.

  • Supporting workloads during incremental migration.

In each of these scenarios, query federation gets you to insights faster, because you can query the data in place and avoid complex and time-consuming ETL processing.

In short, Lakehouse Federation is meant for use cases when:

  • You don’t want to ingest data into Databricks.

  • You want your queries to take advantage of compute in the external database system.

  • You want the advantages of Unity Catalog interfaces and data governance, including fine-grained access control, data lineage, and search.

This section compares Lakehouse Federation to some other Databricks tools for accessing data in external data systems.

Lakehouse Federation vs. Delta Sharing

Delta Sharing is great when you can share the data files directly. Lakehouse Federation gives you access to compute in the external system when direct sharing is not available.

Lakehouse Federation vs. Unity Catalog external tables

For data that you access directly from simple file systems, like Amazon S3, Google Cloud Storage, or Azure Data Lake Storage Gen2, you should create external tables in Unity Catalog. But if you need to run complex queries and want access to compute in the external database system, you should use Lakehouse Federation.

Lakehouse Federation vs. legacy query federation

Legacy query federation configurations, like those documented in Set up query federation for non-Unity-Catalog workspaces and Interact with external data on Databricks, require that you configure connections to each table separately, rather than making entire databases available in Databricks as catalogs. They don’t provide you with Lakehouse Federation’s Unity Catalog data governance tools or ability to ingest a whole database. Finally, they don’t give you the query pushdown and performance capabilities of Lakehouse Federation.

However, if you cannot use Unity Catalog or you’re using authentication methods other than username and password, legacy query federation might be your best option.

If you do want to migrate your query federation configurations to Lakehouse Federation, see Migrate legacy query federation connections to Lakehouse Federation.

Overview of Lakehouse Federation setup

To make a dataset available for read-only querying using Lakehouse Federation, you create the following:

  • A connection, a securable object in Unity Catalog that specifies a path and credentials for accessing an external database system.

  • A foreign catalog, a securable object in Unity Catalog that mirrors a database in an external data system, enabling you to perform read-only queries on that data system in your Databricks workspace, managing access using Unity Catalog.

Supported data sources

Lakehouse Federation supports connections to the following database types:

Before you begin

Workspace requirements:

  • Workspace enabled for Unity Catalog.

Compute requirements:

  • Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database systems. See Networking recommendations for Lakehouse Federation.

  • Databricks clusters must use Databricks Runtime 13.1 or above and the shared cluster access mode.

  • SQL warehouses must be Pro or Serverless.

Permissions required:

  • To create a connection, you must be a metastore admin or a user with the CREATE CONNECTION privilege on the Unity Catalog metastore attached to the workspace.

  • To create a foreign catalog, you must have the CREATE CATALOG permission on the metastore and be either the owner of the connection or have the CREATE FOREIGN CATALOG privilege on the connection.

Additional permission requirements are specified in each task-based section that follows.

Create a connection

A connection specifies a path and credentials for accessing an external database system. To create a connection, you can use Catalog Explorer or the CREATE CONNECTION SQL command in a Databricks notebook or the Databricks SQL query editor.

Permissions required: Metastore admin or user with the CREATE CONNECTION privilege.

  1. In your Databricks workspace, click Catalog icon Catalog.

  2. In the left pane, expand the External Data menu and select Connections.

  3. Click Create connection.

  4. Enter a user-friendly Connection name.

  5. Select the Connection type (database provider, like MySQL or PostgreSQL).

  6. Enter the connection properties (such as host information, path, and access credentials).

    Each connection type requires different connection information. See the article for your connection type, listed in the table of contents to the left.

  7. (Optional) Click Test connection to confirm that it works.

  8. (Optional) Add a comment.

  9. Click Create.

Run the following command in a notebook or the Databricks SQL query editor. This example is for connections to a PostgreSQL database. The options differ by connection type. See the article for your connection type, listed in the table of contents to the left.

CREATE CONNECTION <connection-name> TYPE postgresql
OPTIONS (
  host '<hostname>',
  port '<port>',
  user '<user>',
  password '<password>'
);

We recommend that you use Databricks secrets instead of plaintext strings for sensitive values like credentials. For example:

CREATE CONNECTION <connection-name> TYPE postgresql
OPTIONS (
  host '<hostname>',
  port '<port>',
  user secret ('<secret-scope>','<secret-key-user>'),
  password secret ('<secret-scope>','<secret-key-password>')
)

For information about setting up secrets, see Secret management.

For information about managing existing connections, see Manage connections for Lakehouse Federation.

Create a foreign catalog

A foreign catalog mirrors a database in an external data system so that you can query and manage access to data in that database using Databricks and Unity Catalog. To create a foreign catalog, you use a connection to the data source that has already been defined.

To create a foreign catalog, can use Catalog Explorer or the CREATE FOREIGN CATALOG SQL command in a Databricks notebook or the Databricks SQL query editor.

Permissions required: CREATE CATALOG permission on the metastore and either ownership of the connection or the CREATE FOREIGN CATALOG privilege on the connection.

  1. In your Databricks workspace, click Catalog icon Catalog.

  2. Click the Create Catalog button.

  3. On the Create a new catalog dialog, enter a name for the catalog and select a Type of Foreign.

  4. Select the Connection that provides access to the database that you want to mirror as a Unity Catalog catalog.

  5. Enter the name of the Database that you want to mirror as a catalog.

    Requirements differ depending on the data source:

    • MySQL uses a two-layer namespace and therefore does not require a database name.

    • For connections to a catalog in another Databricks workspace, enter the Databricks Catalog name instead of a database name.

  6. Click Create.

  1. Run the following SQL command in a notebook or Databricks SQL editor. Items in brackets are optional. Replace the placeholder values:

    • <catalog-name>: Name for the catalog in Databricks.

    • <connection-name>: The connection object that specifies the data source, path, and access credentials.

    • <database-name>: Name of the database you want to mirror as a catalog in Databricks. Not required for MySQL, which uses a two-layer namespace.

    • <external-catalog-name>: Databricks-to-Databricks only: Name of the catalog in the external Databricks workspace that you are mirroring. See Create a foreign catalog.

    CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
    OPTIONS (database '<database-name>');
    

For information about managing and working with foreign catalogs, see Manage and work with foreign catalogs.

Lakehouse federation and materialized views

Databricks recommends loading external data using Lakehouse Federation when you are creating materialized views. See Use materialized views in Databricks SQL.

When you use Lakehouse Federation, users can reference the federated data as follows:

CREATE MATERIALIZED VIEW xyz AS SELECT * FROM federated_catalog.federated_schema.federated_table;

Limitations

  • Queries are read-only in this release.

  • Throttling of connections is determined using the Databricks SQL concurrent query limit. There is no limit across warehouses per connection.

  • Pushdown coverage may be limited during this release. A list of supported pushdowns is included in the setup article for each connection type (see the table of contents to the left).

  • Connection types: Databricks is working on adding additional connection types.

    Reach out to your Databricks contact to learn about connection types that will be added soon or to make a connection type request.

  • Private Link and static IP range support on Serverless SQL warehouses are not available in this release. Reach out to your Databricks contact for details about ongoing planning for support.

  • Oauth is not supported in this release. Reach out to your Databricks contact for details about ongoing planning for support.

  • You cannot create foreign schemas or foreign tables directly. Reach out to your Databricks contact for details about ongoing planning for support.

  • The MySQL connector does not support case sensitive table in this release. All table names are converted to lowercase in Unity Catalog, which means that lookups must use lowercase names. Reach out to your Databricks contact for details about ongoing planning for support.

  • Tables and schemas with names that are invalid in Unity Catalog are not supported and are ignored by Unity Catalog upon creation of a foreign catalog. See a list of naming rules and limitations in Unity Catalog limitations.