Skip to main content

Create a Databricks compatible JAR

A Java archive (JAR) packages Java or Scala code for deployment in Lakeflow Jobs. This article covers JAR compatibility requirements and project configuration for different compute types.

tip

For automated deployment and continuous integration workflows, use Databricks Asset Bundles to create a project from a template with pre-configured build and deployment settings. See Build a Scala JAR using Databricks Asset Bundles and Bundle that uploads a JAR file to Unity Catalog. This article describes the manual approach for understanding JAR requirements and custom configurations.

At a high level, your JAR must meet the following requirements for compatibility:

  • Match versions: Use the same Java Development Kit (JDK), Scala, and Spark versions as your compute
  • Provide dependencies: Include required libraries in your JAR or install them on your compute
  • Use the Databricks Spark session: Call SparkSession.builder().getOrCreate() to access the session
  • Allowlist your JAR (standard compute only): Add your JAR to the allowlist
Beta

Serverless Scala and Java jobs are in Beta. You can use JAR tasks to deploy your JAR. See Manage Databricks previews if it's not already enabled.

Compute architecture

Serverless and standard compute use Spark Connect architecture to isolate user code and enforce Unity Catalog governance. Databricks Connect provides access to Spark Connect APIs. Serverless and standard compute don't support Spark Context or Spark RDD APIs. See serverless limitations and standard access mode limitations.

Dedicated compute uses the classic Spark architecture and provides access to all Spark APIs.

Find your JDK, Scala, and Spark versions

Match JDK, Scala, and Spark versions running on your compute

When you build a JAR, your JDK, Scala, and Spark versions must match the versions running on your compute. These three versions are interconnected - the Spark version determines the compatible Scala version, and both depend on a specific JDK version.

Follow these steps to find the correct versions for your compute type:

  1. Use serverless environment version 4 or higher
  2. Find the Databricks Connect version for your environment in the serverless environment versions table. The Databricks Connect version corresponds to your Spark version.
  3. Look up the matching JDK, Scala, and Spark versions in the version support matrix
note

Using mismatched JDK, Scala, or Spark versions may cause unexpected behavior or prevent your code from running.

Project setup

Once you know your version requirements, configure your build files and package your JAR.

Set JDK and Scala versions

Configure your build file to use the correct JDK and Scala versions. The following examples show the versions for Databricks Runtime 17.3 LTS and serverless environment version 4-scala-preview.

In build.sbt:

Scala
scalaVersion := "2.13.16"

javacOptions ++= Seq("-source", "17", "-target", "17")

Spark dependencies

Add a Spark dependency to access Spark APIs without packaging Spark in your JAR.

Use Databricks Connect

Add a dependency on Databricks Connect (recommended). The Databricks Connect version must match the Databricks Connect version in your serverless environment. Mark it as provided because it's included in the runtime. Don't include Apache Spark dependencies like spark-core or other org.apache.spark artifacts in your build file. Databricks Connect provides all necessary Spark APIs.

Maven pom.xml:

XML
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-connect_2.13</artifactId>
<version>17.0.2</version>
<scope>provided</scope>
</dependency>

sbt build.sbt:

Scala
libraryDependencies += "com.databricks" %% "databricks-connect" % "17.0.2" % "provided"

Alternative: spark-sql-api

You can compile against spark-sql-api instead of Databricks Connect, but Databricks recommends using Databricks Connect because the Spark APIs running on serverless compute may differ slightly from open-source Spark. These libraries are included in the runtime, so mark them as provided.

Maven pom.xml:

XML
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-api</artifactId>
<version>4.0.1</version>
<scope>provided</scope>
</dependency>

sbt build.sbt:

Scala
libraryDependencies += "org.apache.spark" %% "spark-sql-api" % "4.0.1" % "provided"

Application dependencies

Add your application's required libraries to your build file. How you manage these depends on your compute type:

Serverless compute provides Databricks Connect and a limited set of dependencies (see release notes). Package all other libraries in your JAR using sbt-assembly or Maven Shade Plugin, or add them to your serverless environment.

For example, to package a library in your JAR:

Maven pom.xml:

XML
<dependency>
<groupId>io.circe</groupId>
<artifactId>circe-core_2.13</artifactId>
<version>0.14.10</version>
</dependency>

sbt build.sbt:

Scala
libraryDependencies += "io.circe" %% "circe-core" % "0.14.10"

Code requirements

When writing your JAR code, follow these patterns to ensure compatibility with Databricks jobs.

Use the Databricks Spark session

When running a JAR in a job, you must use the Spark session provided by Databricks. The following code shows how to access the session from your code:

Java
SparkSession spark = SparkSession.builder().getOrCreate();

Use try-finally blocks for job cleanup

If you want code that reliably runs at the end of your job, for example, to clean up temporary files, use a try-finally block. Don't use a shutdown hook, because these don't run reliably in jobs.

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.
  • jobCleanup() which must run after jobBody(), whether that function succeeds or returns an exception.

For example, jobBody() creates tables and jobCleanup() drops those tables.

The safe way to ensure that the clean-up method is called is to put a try-finally block in the code:

Scala
try {
jobBody()
} finally {
jobCleanup()
}

Don't try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

Scala
// Do NOT clean up with a shutdown hook like this. This will fail.
val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Databricks manages Spark container lifetimes in a way that prevents shutdown hooks from running reliably.

Read job parameters

Databricks passes parameters to your JAR job as a JSON string array. To access these parameters, inspect the String array passed into your main function.

For more details on parameters, see Parameterize jobs.

Additional configuration

Depending on your compute type, you might need additional configuration:

  • Standard access mode: For security reasons, an administrator must add Maven coordinates and paths for JAR libraries to an allowlist.
  • Serverless compute: If your job accesses private resources (databases, APIs, storage), configure networking with a Network Connectivity Configuration (NCC). See Serverless network security.

Next steps