Skip to main content

Create a Databricks compatible JAR

This page describes how to create a JAR with Scala or Java code that is compatible with your Databricks workspace.

At a high level, your JAR must meet the following requirements for compatibility:

  • Your Java Development Kit (JDK) version matches the JDK version on your Databricks cluster or serverless compute.
  • For Scala, your version of Scala matches the Scala version on your Databricks cluster or serverless compute.
  • Databricks Connect is added as a dependency and matches the version running on your Databricks cluster or serverless compute, or, your spark dependencies are compatible with your Databricks environment.
  • The local project you are compiling is packaged as a single JAR and includes all unprovided dependencies. Alternatively, you can add them to your environment or cluster.
  • The code in your JAR file correctly interacts with the Spark session or context.
  • For standard compute, all JARs used are added to the allowlist.
tip

To create a Java or Scala project that is fully configured to deploy and run a JAR on serverless compute, you can use Databricks Asset Bundles. For an example bundle configuration that uploads a JAR, see Bundle that uploads a JAR file to Unity Catalog. To create a Scala project using Databricks Asset Bundles, see Build a Scala JAR using Databricks Asset Bundles.

Beta

Using serverless compute for JAR tasks is in Beta.

Databricks Connect and Databricks Runtime versioning

When creating a JAR to run in Databricks, it is helpful to understand how you call Spark APIs and what version of the APIs you are calling.

Databricks Connect

Databricks Connect implements the Spark Connect architecture, which separates client and server components. This separation allows you to efficiently share clusters while fully enforcing Unity Catalog governance with measures such as row filters and column masks. However, Unity Catalog clusters in standard access mode have some limitations, for example lack of support for APIs such as Spark Context and RDDs. Limtations are listed in Standard compute requirements and limitations.

Databricks Connect gives you access to all Spark functionality, including Spark Connect, and is included with the standard and serverless compute. For these compute types, Databricks Connect is required, because it provides all necessary Spark APIs.

Databricks Runtime

The Databricks Runtime runs on compute managed by Databricks. It is based on Spark but includes performance improvements and other enhancements for ease of use.

On serverless or standard compute, Databricks Connect provides APIs that call into the Databricks Runtime running on the compute. On dedicated compute, you compile against the Spark APIs, which are backed by the Databricks Runtime on the copmute.

Find the correct versions for your compute

To compile a compatible JAR file, you must know the version of Databricks Connect and Databricks Runtime that your compute is running.

Compute type

How to find the correct versions

Serverless

Uses Databricks Connect. You must serverless environment version 4 or above. To find the current Databricks Connect version for a serverless environment, see Serverless environment versions.

Find the JDK and Scala versions for the Databricks Connect version in the version support matrix.

Compute in standard mode

Uses Databricks Runtime and provides Databricks Connect to call APIs. To find the Databricks Runtime version for your compute:

  • In the workspace, click Cloud icon. Compute in the sidebar, and select your compute. The Databricks Runtime version is displayed in the configuration details.

    The major and minor version values of the Databricks Connect version matches the major and minor version values of the Databricks Runtime version.

Compute in dedicated mode

Uses Databricks Runtime and allows you to compile against Spark APIs directly.

To find the Databricks Runtime version for your compute cluster:

  • In the workspace, click Cloud icon. Compute in the sidebar, and select your compute. The Databricks Runtime version is displayed in the configuration details.

JDK and Scala versions

When you build a JAR, the Java Development Kit (JDK) and Scala versions that you use to compile your code must match the versions running on your compute.

For serverless or standard compute, use the Databricks Connect version to find the compatible JDK and Scala versions. See the version support matrix.

If you are using dedicated compute, you must match the JDK and Scala versions of the Databricks Runtime on the compute. The System environment section of the Databricks Runtime release notes versions and compatibility for each Databricks Runtime version includes the correct Java and Scala versions. For example, for Databricks Runtime 17.3 LTS, see Databricks Runtime 17.3 LTS.

note

Using a JDK or Scala version that doesn't match your compute's JDK or Scala versions may cause unexpected behavior or prevent your code from running.

Dependencies

You must set up your dependencies correctly in your build file.

Databricks Connect or Apache Spark

For standard or serverless compute, Databricks recommends adding a dependency on Databricks Connect instead of Spark to build JARs. Databricks Runtime is not identical to Spark, and includes performance and stability improvements. Databricks Connect provides the Spark APIs that are available in Databricks. To include Databricks Connect, add a dependency:

In the Maven pom.xml file:

<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-connect_2.13</artifactId>
<version>17.0.2</version>
</dependency>
note

The Databricks Connect version must match the version included in the Databricks Runtime of your cluster.

Databricks recommends depending on Databricks Connect. If you do not want to use Databricks Connect, compile against spark-sql-api. Add this specific Spark library to your dependencies, but do not include the library in your JAR. In the build file, configure the scope for the dependency as provided:

In the Maven pom.xml file:

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-api</artifactId>
<version>4.0.1</version>
<scope>provided</scope>
</dependency>

Spark dependencies

For standard compute and serverless, do not include any other Spark dependencies in your project. Using Databricks Connect provides all of the necessary Spark session APIs.

Classic compute and Databricks Runtime provided libraries

If you are running on classic compute (in either dedicated or standard mode), the Databricks Runtime includes many common libraries. Find the libraries and versions that are included in the System Environment section of the Databricks Runtime release notes for your Databricks Runtime version. For example, the Databricks Runtime 17.3 LTS System Environment section lists the versions of each library available in the Databricks Runtime.

To compile against one of these libraries, add it as a dependency with the provided option. For example, in Databricks Runtime 17.3 LTS, the protobuf-java library is provided, and you can compile against it with the following configuration:

In the Maven pom.xml:

<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.25.5</version>
<scope>provided</scope>
</dependency>

Serverless, and non-provided libraries

Serverless provides a reduced set of dependencies by default, to reduce issues caused by conflicting libraries.

For libraries that aren't available on serverless compute or in the Databricks Runtime, you can include them yourself, in your JAR. For example, to include circe-core in your build.sbt file, add the following command:

libraryDependencies += "io.circe" %% "circe-core" % "0.14.10"

To add libraries to the serverless environment, see Configure environment for non-notebook job tasks.

Package as a single JAR

Databricks recommends packaging your application and all dependencies into a single JAR file, also known as an über or fat JAR. For sbt, use sbt-assembly, and for Maven, use maven-shade-plugin. See the official Maven Shade Plugin and sbt-assembly documentation for details.

Alternatively, you can install dependencies as cluster-scoped libraries. See compute-scoped libraries for more information. If you install libraries on your cluster, your dependency in code should include the provided keyword to avoid packaging the library in your jar. For serverless, add them to the serverless environment that you use. See Configure environment for non-notebook job tasks.

note

For Scala JARs installed as libraries on Unity Catalog standard clusters, classes in the JAR libraries must be in a named package, such as com.databricks.MyClass, or errors will occur when importing the library.

Using the Spark session in your code

When you are running a JAR within a job, you must use the Spark session that is provided by Databricks for the job. The following code shows how to access the session from your code:

SparkSession spark = SparkSession.builder().getOrCreate();

Ensure your JAR is allowlisted (standard compute)

For security reasons, standard access mode requires an administrator to add Maven coordinates and paths for JAR libraries to an allowlist. See Allowlist libraries and init scripts on compute with standard access mode (formerly shared access mode).

Recommendation: Use try-finally blocks for job clean up

If you want to have code that reliably runs at the end of your job, for example, to cleanup temporary files that were created during the job, use a try-finally block. Do not use a shutdown hook, as these do not run reliably in jobs.

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.
  • jobCleanup() which has to be executed after jobBody(), whether that function succeeded or returned an exception.

For example, jobBody() creates tables and jobCleanup() drops those tables.

The safe way to ensure that the clean-up method is called is to put a try-finally block in the code:

Scala
try {
jobBody()
} finally {
jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

Scala
// Do NOT clean up with a shutdown hook like this. This will fail.
val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Because of the way the lifetime of Spark containers is managed in Databricks, the shutdown hooks are not run reliably.

Reading job parameters

Pparameters are passed to your JAR job with a JSON string array. To access these parameters, inspect the String array passed into your main function.

For more details on parameters, see Parameterize jobs.

Configure serverless networking

If your job accesses private resources (databases, APIs, storage), configure networking with a Network Connectivity Configuration (NCC). See Serverless network security.

Build a JAR

The following steps take you through creating and compiling a simple JAR file using Scala or Java to work in Databricks.

Requirements

Your local development environment must have the following:

  • Java Development Kit (JDK) 17.
  • sbt (for Scala JARs).
  • Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command databricks -v. To install the Databricks CLI, see Install or update the Databricks CLI.
  • Databricks CLI authentication is configured with a DEFAULT profile. To configure authentication, see Configure access to your workspace.

Create a Scala JAR

  1. Run the following command to create a new Scala project:

    > sbt new scala/scala-seed.g8

    When prompted, enter a project name, for example, my-spark-app.

  2. Replace the contents of your build.sbt file with the following. Choose the Scala and Databricks Connect versions that are needed for your compute. See Dependencies.

    Scala
    scalaVersion := "2.13.16"
    libraryDependencies += "com.databricks" %% "databricks-connect" % "17.0.1"
    // other dependencies go here...

    // to run with new jvm options, a fork is required otherwise it uses same options as sbt process
    fork := true
    javaOptions += "--add-opens=java.base/java.nio=ALL-UNNAMED"
  3. Edit or create a project/assembly.sbt file, and add this line:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.3.1")
  4. Create your main class in src/main/scala/example/DatabricksExample.scala:

    Scala
    package com.examples

    import com.databricks.connect.DatabricksSession
    import org.apache.spark.sql.{SparkSession}

    object SparkJar {
    def main(args: Array[String]): Unit = {
    val spark: SparkSession = DatabricksSession.builder().getOrCreate()

    // Prints the arguments to the class, which
    // are job parameters when run as a job:
    println(args.mkString(", "))

    // Shows using spark:
    println(spark.version)
    println(spark.range(10).limit(3).collect().mkString(" "))
    }
    }
  5. To build your JAR file, run the following command:

    Bash
    > sbt assembly

Create a Java JAR

  1. Create a folder for your JAR.

  2. In the folder, create a file named PrintArgs.java with the following contents:

    Java
    import java.util.Arrays;

    public class PrintArgs {
    public static void main(String[] args) {
    System.out.println(Arrays.toString(args));
    }
    }
  3. Compile the PrintArgs.java file, which creates the file PrintArgs.class:

    Bash
    javac PrintArgs.java
  4. (Optional) Run the compiled program:

    Bash
    java PrintArgs Hello World!

    # [Hello, World!]
  5. In the folder, create a pom.xml file, and add the following code to enable Maven shade.

    <build>
    <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.6.0</version>
    <executions>
    <execution>
    <phase>package</phase>
    <goals><goal>shade</goal></goals>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>
  6. In the JAR folder, create a folder named META-INF.

  7. In the META-INF folder, create a file named MANIFEST.MF with the following contents. Be sure to add a newline at the end of this file:

    Main-Class: PrintArgs
  8. From your JAR folder, create a JAR named PrintArgs.jar:

    Bash
    jar cvfm PrintArgs.jar META-INF/MANIFEST.MF *.class
  9. (Optional) To test it, run the JAR:

    Bash
    java -jar PrintArgs.jar Hello World!

    # [Hello, World!]
    note

    If you get the error no main manifest attribute, in PrintArgs.jar, be sure to add a newline to the end of the MANIFEST.MF file, and then try creating and running the JAR again.

Next steps