Adapt your exisiting Apache Spark code for Databricks

This article outlines the required changes to adapt existing Apache Spark workloads to run on Databricks. Whether you’re moving to Databricks from an on-premises cluster, custom cloud-based infrastructure, or another enterprise Apache Spark offering, most workloads require only a few changes to get into production. Databricks extends, simplifies, and improves the performance of Apache Spark by introducing custom optimizations, configuring and deploying infrastructure, and maintaining dependencies in Databricks Runtime.

Important

When you upgrade versions of Apache Spark, there might be breaking changes to syntax. See Databricks runtime release notes and the Spark migration guide.

Change parquet to delta

Databricks recommends using Delta Lake instead of Parquet or ORC when writing data. Databricks has optimized many features for efficiency when interacting with tables backed by Delta Lake, and upgrading data and code form Parquet to Delta Lake only takes a few steps. See Migrate a Parquet data lake to Delta Lake.

Because Delta Lake provides ACID transaction guarantees, you might be able to simplify workloads to remove workarounds geared toward creating pseudo-transactionality in Apache Spark operations. Examples include:

  • Building a directory structure or partitioning strategy that allows all files from a given operation to be discovered simultaneously as part of a partition.

  • Configuring or relying on the metastore to add transactionality for how new data is discovered.

  • Using MSCK repair to register files written to a table to the metastore.

  • Using alter table add partition to manually add partitions to a table.

See When to partition tables on Databricks.

Note

You can run workloads without upgrading the data formats used, but many of the greatest performance gains on Databricks are tied directly to Delta Lake.

Recompile Apache Spark code with Databricks Runtime compatible libraries

Each version of Databricks Runtime comes pre-configured with many of the libraries required in Apache Spark applications. You can install additional libraries to your compute as required, but whenenever possible, Databricks recommends using library versions packaged in the Databricks Runtime which are tested for compatability. Each Databricks Runtime release includes a list of all installed libraries. See Databricks runtime release notes.

Remove SparkSession creation commands

Many legacy Apache Spark workloads explicitly declare a new SparkSession for each job. Databricks automatically creates a SparkContext for each compute cluster, and creates an isolated SparkSession for each notebook or job executed against the cluster. You can maintain the ability to compile and test code locally and then deploy to Databricks by upgrading these commands to use SparkSession.builder().getOrCreate().

Remove terminal script commands

Apache Spark requires programs to explicitly declare that they are complete by using commands such as sys.exit() or sc.stop(). Databricks automatically terminates and cleans up jobs as they reach completion, so these commands are not necessary and should be removed.

Databricks also automatically terminates and cleans up Structured Streaming workloads on run termination, so you can remove awaitTermination() and similar commands from Structured Streaming applications.

Trust Databricks to configure your cluster

Databricks configures all of the settings for the driver and executors in your compute cluster automatically to maximize resiliency and resource usage. Providing custom configurations for the executors or JVM can result in reduced performance. Databricks recommends only setting Spark configurations that are necessary for controlling type handling or functions so that logic remains consistent.

Run your workloads

Now that you’ve removed patterns, commands, and settings that might interfere with Databricks execution, you can run your workloads in a test environment and compare performance and results to your legacy infrastructure. While many of the skills your team might have developed to troubleshoot and improve performance for Apache Spark workloads can still be leveraged on Databricks, more often you can see greater gains from upgrading steps to use new features in Apache Spark, Delta Lake, or custom Databricks products.