What is query federation?

The term query federation describes a collection of features that enable users and systems to run queries against multiple siloed data sources without needing to migrate all data to a unified system.

Where does query federation fit in the lakehouse?

The lakehouse emphasizes storing data centrally to reduce data redundancy and isolation, but many companies have numerous data systems in production. You might desire to query data in connected systems for a number of reasons:

  • Ad hoc reporting.

  • Proof of concept work.

  • Developing new ETL pipelines or reports.

  • Supporting workloads during incremental migration.

You might choose not to migrate or ingest some datasets to Databricks, but still need to give access to some users for isolated use cases.

What is query federation for Databricks SQL?

Preview

This feature is Experimental. Experimental features are provided as-is and are not supported by Databricks through customer technical support channels.

Databricks SQL allows you to configure read-only connections to popular database solutions with drivers included on all serverless and pro SQL warehouses.

For details on configuring connections, see the following:

What is query federation on Databricks?

Apache Spark has always provided support for connecting to data in a variety of formats and from a variety of systems and data sources. Databricks builds on these open source connects and bundles additional libraries in the Databricks Runtime to integrate with many external data sources.

Connections to many databases use the Apache Spark JDBC connector. You can specify a number of options to tune the parallelism of these connections, and can pushdown queries to source systems as desired.

In Databricks Runtime 11.3 and above, secrets have support in SQL in addition to Python, R, and Scala, allowing user-scoped credentials to be configured using redacted strings.

Does Databricks allow federated queries from other systems?

  • Databricks provides JDBC and ODBC drivers compatible with many BI tools.

  • Delta Sharing provides an open source protocol for sharing Delta Lake tables with users connecting from numerous supported clients.

  • Delta Lake is a fully open source storage protocol with many integrations.

  • Databricks has partnered with a number of BI and visualization tools to support querying data in the lakehouse.