This article provides an overview of options for migrating extract, transform, load (ETL) pipelines running on other data systems to Databricks. If you are migrating Apache Spark code, see Adapt your exisiting Apache Spark code for Databricks.
For general information about moving from an enterprise data warehouse to a lakehouse, see Migrate your data warehouse to the Databricks lakehouse. For information about moving from Parquet to Delta Lake, see Migrate a Parquet data lake to Delta Lake.
Most Hive workloads can run on Databricks with minimal refactoring. The version of Spark SQL supported by Databricks Runtime allows many HiveQL constructs. See Apache Hive compatibility. Databricks includes a Hive metastore by default. Most Hive migrations need to address a few primary concerns:
Hive SerDe need to be updated to use Databricks-native file codecs. (Change DDL from
USINGto use Databricks SerDe.)
Hive UDFs must either be installed to clusters as libraries or refactored to native Spark. Because Hive UDFs are already in the JVM, they might provide sufficient performance for many workloads. See Which UDFs are most efficient?.
The directory structure for tables should be altered, as Databricks uses partitions differently than Hive. See When to partition tables on Databricks.
If you choose to update your tables to Delta Lake during your initial migration, a number of DDL and DML statements are unsupported. These include:
ANALYZE TABLE PARTITION
ALTER TABLE [ADD|DROP] PARTITION
ALTER TABLE RECOVER PARTITIONS
ALTER TABLE SET SERDEPROPERTIES
CREATE TABLE LIKE
INSERT OVERWRITE DIRECTORY
Specifying target partitions using
Migrating SQL workloads from other systems to Databricks usually requires very little refactoring, depending on the extent to which system specific protocols were used in the source code. Databricks uses Delta Lake as the default table format, so tables are created with transactional guarantees by default.
Spark SQL is mostly ANSI-compliant, but some differences in behavior might exist. See How is the Databricks Data Intelligence Platform different than an enterprise data warehouse?.
Because data systems tend to configure access to external data differently, much of the work refactoring SQL ETL pipelines might be configuring access to these data sources and then updating your logic to use these new connections. Databricks provides options for connecting to many data sources for ingestion.
Databricks provides a native integration with dbt, allowing you to leverage existing dbt scripts with very little refactoring.
Delta Live Tables provides an optimized Databricks-native declarative SQL syntax for creating, testing, and deploying pipelines. While you can leverage dbt on Databricks, a light refactor of code to Delta Live Tables might lower your total cost to operate your pipelines on Databricks. See What is Delta Live Tables?.
The extensibility and versatility of custom serverless cloud functions makes it difficult to provide a common recommendation, but one of the most common use cases for these functions is waiting for files or data to appear in a location or message queue and then performing some action as a result. While Databricks does not support complex logic for triggering workloads based on cloud conditions, you can use Structured Streaming in conjunction with workflows to process data incrementally.
ETL pipelines defined in languages other than SQL, Apache Spark, or Hive might need to be heavily refactored before running on Databricks. Databricks has experience helping customers migrate from most of the data systems in use today, and might have resources available to jumpstart your migration efforts.