This article outlines the required changes to adapt existing Apache Spark workloads to run on Databricks. Whether you’re moving to Databricks from an on-premises cluster, custom cloud-based infrastructure, or another enterprise Apache Spark offering, most workloads require only a few changes to get into production. Databricks extends, simplifies, and improves the performance of Apache Spark by introducing custom optimizations, configuring and deploying infrastructure, and maintaining dependencies in Databricks Runtime.
Databricks recommends using Delta Lake instead of Parquet or ORC when writing data. Databricks has optimized many features for efficiency when interacting with tables backed by Delta Lake, and upgrading data and code form Parquet to Delta Lake only takes a few steps. See Migrate a Parquet data lake to Delta Lake.
Because Delta Lake provides ACID transaction guarantees, you might be able to simplify workloads to remove workarounds geared toward creating pseudo-transactionality in Apache Spark operations. Examples include:
Building a directory structure or partitioning strategy that allows all files from a given operation to be discovered simultaneously as part of a partition.
Configuring or relying on the metastore to add transactionality for how new data is discovered.
MSCK repairto register files written to a table to the metastore.
alter table add partitionto manually add partitions to a table.
You can run workloads without upgrading the data formats used, but many of the greatest performance gains on Databricks are tied directly to Delta Lake.
Each version of Databricks Runtime comes pre-configured with many of the libraries required in Apache Spark applications. You can install additional libraries to your compute as required, but whenenever possible, Databricks recommends using library versions packaged in the Databricks Runtime which are tested for compatability. Each Databricks Runtime release includes a list of all installed libraries. See Databricks runtime release notes.
Many legacy Apache Spark workloads explicitly declare a new SparkSession for each job. Databricks automatically creates a SparkContext for each compute cluster, and creates an isolated SparkSession for each notebook or job executed against the cluster. You can maintain the ability to compile and test code locally and then deploy to Databricks by upgrading these commands to use
Apache Spark requires programs to explicitly declare that they are complete by using commands such as
sc.stop(). Databricks automatically terminates and cleans up jobs as they reach completion, so these commands are not necessary and should be removed.
Databricks also automatically terminates and cleans up Structured Streaming workloads on run termination, so you can remove
awaitTermination() and similar commands from Structured Streaming applications.
Databricks configures all of the settings for the driver and executors in your compute cluster automatically to maximize resiliency and resource usage. Providing custom configurations for the executors or JVM can result in reduced performance. Databricks recommends only setting Spark configurations that are necessary for controlling type handling or functions so that logic remains consistent.
Now that you’ve removed patterns, commands, and settings that might interfere with Databricks execution, you can run your workloads in a test environment and compare performance and results to your legacy infrastructure. While many of the skills your team might have developed to troubleshoot and improve performance for Apache Spark workloads can still be leveraged on Databricks, more often you can see greater gains from upgrading steps to use new features in Apache Spark, Delta Lake, or custom Databricks products.