Migrate ETL pipelines to Databricks

This article provides an overview of options for migrating extract, transform, load (ETL) pipelines running on other data systems to Databricks. If you are migrating Apache Spark code, see Adapt your exisiting Apache Spark code for Databricks.

For general information about moving from an enterprise data warehouse to a lakehouse, see Migrate your data warehouse to the Databricks lakehouse. For information about moving from Parquet to Delta Lake, see Migrate a Parquet data lake to Delta Lake.

Can you run Hive pipelines on Databricks?

Most Hive workloads can run on Databricks with minimal refactoring. The version of Spark SQL supported by Databricks Runtime allows many HiveQL constructs. See Apache Hive compatibility. Databricks includes a Hive metastore by default. Most Hive migrations need to address a few primary concerns:

  • Hive SerDe need to be updated to use Databricks-native file codecs. (Change DDL from STORED AS to USING to use Databricks SerDe.)

  • Hive UDFs must either be installed to clusters as libraries or refactored to native Spark. Because Hive UDFs are already in the JVM, they might provide sufficient performance for many workloads. See Performance considerations.

  • The directory structure for tables should be altered, as Databricks uses partitions differently than Hive. See When to partition tables on Databricks.

If you choose to update your tables to Delta Lake during your initial migration, a number of DDL and DML statements are unsupported. These include:

  • ROWFORMAT

  • SERDE

  • OUTPUTFORMAT

  • INPUTFORMAT

  • COMPRESSION

  • STORED AS

  • ANALYZE TABLE PARTITION

  • ALTER TABLE [ADD|DROP] PARTITION

  • ALTER TABLE RECOVER PARTITIONS

  • ALTER TABLE SET SERDEPROPERTIES

  • CREATE TABLE LIKE

  • INSERT OVERWRITE DIRECTORY

  • LOAD DATA

  • Specifying target partitions using PARTITION (part_spec) in TRUNCATE TABLE

Can you run SQL ETL pipelines on Databricks?

Migrating SQL workloads from other systems to Databricks usually requires very little refactoring, depending on the extent to which system specific protocols were used in the source code. Databricks uses Delta Lake as the default table format, so tables are created with transactional guarantees by default.

Spark SQL is mostly ANSI-compliant, but some differences in behavior might exist. See How is the Databricks Data Intelligence Platform different than an enterprise data warehouse?.

Because data systems tend to configure access to external data differently, much of the work refactoring SQL ETL pipelines might be configuring access to these data sources and then updating your logic to use these new connections. Databricks provides options for connecting to many data sources for ingestion.

Can you run dbt ETL pipelines on Databricks?

Databricks provides a native integration with dbt, allowing you to leverage existing dbt scripts with very little refactoring.

Delta Live Tables provides an optimized Databricks-native declarative SQL syntax for creating, testing, and deploying pipelines. While you can leverage dbt on Databricks, a light refactor of code to Delta Live Tables might lower your total cost to operate your pipelines on Databricks. See What is Delta Live Tables?.

Can you migrate serverless cloud functions to Databricks?

The extensibility and versatility of custom serverless cloud functions makes it difficult to provide a common recommendation, but one of the most common use cases for these functions is waiting for files or data to appear in a location or message queue and then performing some action as a result. While Databricks does not support complex logic for triggering workloads based on cloud conditions, you can use Structured Streaming in conjunction with Jobs to process data incrementally.

Use Auto Loader for optimized data ingestion from cloud object storage. Structured Streaming can process data from streaming sources in near-real time.

Can you run syntax from other data systems on Databricks?

ETL pipelines defined in languages other than SQL, Apache Spark, or Hive might need to be heavily refactored before running on Databricks. Databricks has experience helping customers migrate from most of the data systems in use today, and might have resources available to jumpstart your migration efforts.