Create and run JARs on serverless compute
Databricks strongly recommends Declarative Automation Bundles instead of building and deploying JARs manually as described on this page. Declarative Automation Bundles makes it easy to create a project from a template that has the correct Scala, JDK, and Databricks Connect versions already configured for serverless, and also enables simple deployment of the JAR to the Databricks workspace. See Build a Scala JAR with Declarative Automation Bundles.
Serverless Scala and Java jobs are in Public Preview.
A Java archive (JAR) packages Java or Scala code into a single file. This page shows you how to create a JAR with Spark code and deploy it as a Lakeflow Job on serverless compute. You can use JAR tasks to deploy your JAR.
Requirements
To build a JAR, your local development environment must have the following installed:
- sbt 1.11.7 or higher for Scala JARs
- Maven 3.9.0 or higher for Java JARs
- JDK, Scala, and Databricks Connect versions that match your serverless environment. See Dependency versions.
Dependency versions
To run on serverless compute without failures, your JAR Scala and JDK versions must exactly match the runtime Scala and JDK versions. See Databricks Connect versions.
The example on this page uses serverless environment version 4, so this page creates a JAR that:
- Is compiled against Scala 2.13; every dependency uses the
_2.13suffix. - Is compiled against JDK 17, class file version 61.
- Is compiled against Databricks Connect 17.3, the Spark API surface for serverless compute.
- Uses only public Spark APIs. It uses no RDDs and no Spark internals. See Limitations.
- Includes every dependency in the JAR or attached as a serverless environment library. See Managing dependencies.
Limitations
Serverless compute uses Spark Connect. Your JAR runs against a thin client library that exposes the public Spark APIs, while the Spark engine itself runs server-side. Code that bypasses the public API can't benefit from Catalyst optimization or Photon acceleration, even on classic compute. RDD-based and internals-dependent code is generally slower than the equivalent DataFrame or SQL code.
The following aren't available:
- RDD API (
org.apache.spark.rdd.*) andSparkContext/JavaSparkContext. UseSparkSession.builder().getOrCreate()and DataFrame/Dataset operations instead. - Spark internal APIs (
org.apache.spark.catalyst.*,org.apache.spark.util.*,org.apache.spark.sql.util.*,org.apache.spark.sql.internal.*). Code that imports these APIs fail withNoClassDefFoundError. Refactor to the public Spark API. If a third-party library uses internals, check whether it publishes a Spark Connect-compatible release. - Native libraries (
.so,.dll, JNI). Serverless compute does not permit writing native libraries to the file system. Libraries that unpack native binaries at startup fail withUnsatisfiedLinkError. Init scripts are not a workaround. Use a Java equivalent if one is available.
If your workload requires any of the above, run it on standard or dedicated compute instead.
Step 1: Build a JAR
- Scala
- Java
-
Run the following command to create a Scala project:
Bashsbt new scala/scala-seed.g8When prompted, enter a project name, for example,
my-spark-app. -
Next, delete the seed's stub files and create the directory for your source:
Bashcd my-spark-app
rm src/main/scala/example/Hello.scala
rm src/test/scala/example/HelloSpec.scala
rm project/Dependencies.scala
mkdir -p src/main/scala/com/examples -
Replace the contents of your
build.sbtfile with the following:Scalaname := "my-spark-app"
// Set the dependency versions
scalaVersion := "2.13.16"
javacOptions ++= Seq("--release", "17")
scalacOptions ++= Seq("-release", "17")
libraryDependencies += "com.databricks" %% "databricks-connect" % "17.3.2" % "provided"
// Your other dependencies go here. Use %% for Scala libraries so sbt picks the _2.13 artifact.
// Fork a new JVM on run so our javaOptions are applied.
fork := true
javaOptions += "--add-opens=java.base/java.nio=ALL-UNNAMED" -
Edit or create a
project/plugins.sbtfile, and add this line:ScalaaddSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.3.1") -
Create your main class in
src/main/scala/com/examples/SparkJar.scala:Scalapackage com.examples
import org.apache.spark.sql.SparkSession
object SparkJar {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
// Prints the arguments to the class, which
// are job parameters when run as a job:
println(args.mkString(", "))
// Shows using spark:
println(spark.version)
println(spark.range(10).limit(3).collect().mkString(" "))
}
} -
To build your JAR file, run the following command:
Bashsbt assemblyThe compiled JAR is created in the
target/folder asmy-spark-app-assembly-0.1.0-SNAPSHOT.jar.
-
Run the following commands to create a Maven project structure:
Bashmkdir -p my-spark-app/src/main/java/com/examples
cd my-spark-app -
Create a
pom.xmlfile in the project root with the following contents:XML<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.examples</groupId>
<artifactId>my-spark-app</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.release>17</maven.compiler.release>
<scala.binary.version>2.13</scala.binary.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- Included on serverless compute. -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-connect_${scala.binary.version}</artifactId>
<version>17.3.2</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- Maven Shade Plugin - Creates a fat JAR with all non-provided dependencies. -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.6.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.examples.SparkJar</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project> -
Create your main class in
src/main/java/com/examples/SparkJar.java:Javapackage com.examples;
import org.apache.spark.sql.SparkSession;
import java.util.stream.Collectors;
public class SparkJar {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().getOrCreate();
// Prints the arguments to the class, which
// are job parameters when run as a job:
System.out.println(String.join(", ", args));
// Shows using spark:
System.out.println(spark.version());
System.out.println(
spark.range(10).limit(3).collectAsList().stream()
.map(Object::toString)
.collect(Collectors.joining(" "))
);
}
} -
To build your JAR file, run the following command:
Bashmvn clean packageThe compiled JAR is created in the
target/folder asmy-spark-app-1.0-SNAPSHOT.jar.
Managing dependencies
To make a library available to your JAR on serverless compute:
- Use a provided library: Serverless compute includes Databricks Connect and a curated set of common libraries. If your version is compatible, declare it
providedin your build and don't include it in your JAR. - Attach as an environment library: Add a library to your serverless environment if it isn't already provided. Use this for runtime-only libraries you don't want to include.
- Connect to an external database: For JDBC sources, use a JDBC connection instead of including a driver. JDBC connections are Unity Catalog-managed. Credentials, lineage, and governance are handled for you.
Provided libraries
The following libraries are required dependencies and are available by default on serverless compute. Declare them provided in your build. Bundling your own versions of these libraries triggers a NoSuchMethodError at runtime.
The library versions listed below are for serverless environment version 4. For installed libraries for other environment versions, see the serverless environment version notes reference.
com.databricks:databricks-connect_2.13, version 17.3.2org.scala-lang:scala-library_2.13, version 2.13.16org.scala-lang:scala-reflect_2.13, version 2.13.16org.slf4j:slf4j-api, version 2.0.10org.apache.logging.log4j:log4j-api, version 2.20.0org.apache.logging.log4j:log4j-core, version 2.20.0org.apache.httpcomponents:httpclient, version 4.5.14org.apache.httpcomponents:httpcore, version 4.4.16com.fasterxml.jackson.core:jackson-databind, version 2.15.2com.fasterxml.jackson.core:jackson-core, version 2.15.2com.fasterxml.jackson.core:jackson-annotations, version 2.15.2com.fasterxml.jackson.datatype:jackson-datatype-jsr310, version 2.15.2com.google.guava:guava, version 32.0.1-jrecommons-io:commons-io, version 2.14.0org.json4s:json4s-jackson_2.13, version 4.0.7org.apache.commons:commons-lang3, version 3.14.0org.apache.commons:commons-configuration2, version 2.11.0org.apache.commons:commons-text, version 1.12.0com.databricks:databricks-sdk-java, version 0.52.0com.databricks:databricks-dbutils-scala_2.13, version 0.1.4
Step 2: Create a job to run the JAR
-
In your workspace, click
Jobs & Pipelines in the sidebar.
-
Click Create, then Job.
-
Click the JAR tile to configure the first task. If the JAR tile is not available, click Add another task type and search for JAR.
-
Optionally, replace the name of the job, which defaults to
New Job <date-time>, with your job name. -
In Task name, enter a name for the task, for example
JAR_example. -
If necessary, select JAR from the Type drop-down menu.
-
For Main class, enter the package and class of your JAR. If you followed the example earlier, enter
com.examples.SparkJar. -
For Compute, select Serverless.
-
Configure the serverless environment:
- Select an environment, then click
Edit to configure it.
- Select 4 or higher for the Environment version.
- Add your JAR file by dragging and dropping it into the file selector, or browse to select it from a Unity Catalog volume or workspace location.
- Select an environment, then click
-
For Parameters, for this example, enter
["Hello", "World!"]. -
Click Create task.
Step 3: Run the job and view the job run details
Click to run the workflow. To view details for the run, click View run in the Triggered run pop-up or click the link in the Start time column for the run in the job runs view.
When the run completes, the output appears in the Output pane, including the arguments you passed to the task.
Troubleshooting
The following table provides troubleshooting information for common exceptions.
Exception | Cause | Fix |
|---|---|---|
| JAR compiled against Scala 2.12; serverless runs Scala 2.13 | Recompile with |
| Scala 2.12 vs 2.13 mismatch | Recompile with |
| Compiled with JDK 18 or higher; serverless runs JDK 17 | Use |
| Spark internals or RDD API were used. These are not available on serverless. | Use the public Spark API (DataFrame/Dataset/SQL). See limitations on serverless. |
| JDBC driver not on classpath | Use a JDBC connection for the external database. |
| The library is not on the serverless classpath. | Add it to your JAR, or provide it as an additional JAR using the serverless environment. |
| Native library included in JAR | Native libraries are not supported on serverless. Use a pure-Java equivalent, or run on classic compute. |
| Your included version conflicts with the version provided by serverless. | Use the provided version. Mark it |
Next steps
- To learn more about JAR tasks, see JAR task for jobs.
- To learn more about creating a compatible JAR, see Create a Databricks compatible JAR.
- To learn more about creating and running jobs, see Lakeflow Jobs.