Use a JAR in a Databricks job

The Java archive or JAR file format is based on the popular ZIP file format and is used for aggregating many Java or Scala files into one. Using the JAR task, you can ensure fast and reliable installation of Java or Scala code in your Databricks jobs. This article provides an example of creating a JAR and a job that runs the application packaged in the JAR. In this example, you will:

  • Create the JAR project defining an example application.

  • Bundle the example files into a JAR.

  • Create a job to run the JAR.

  • Run the job and view the results.

Before you begin

You need the following to complete this example:

  • For Java JARs, the Java Development Kit (JDK).

  • For Scala JARs, the JDK and sbt.

Step 1: Create a local directory for the example

Create a local directory to hold the example code and generated artifacts, for example, databricks_jar_test.

Step 2: Create the JAR

Complete the following instructions to use Java or Scala to create the JAR.

Create a Java JAR

  1. From the databricks_jar_test folder, create a file named PrintArgs.java with the following contents:

    import java.util.Arrays;
    
    public class PrintArgs {
      public static void main(String[] args) {
        System.out.println(Arrays.toString(args));
      }
    }
    
  2. Compile the PrintArgs.java file, which creates the file PrintArgs.class:

    javac PrintArgs.java
    
  3. (Optional) Run the compiled program:

    java PrintArgs Hello World!
    
    # [Hello, World!]
    
  4. In the same folder as the PrintArgs.java and PrintArgs.class files, create a folder named META-INF.

  5. In the META-INF folder, create a file named MANIFEST.MF with the following contents. Be sure to add a newline at the end of this file:

    Main-Class: PrintArgs
    
  6. From the root of the databricks_jar_test folder, create a JAR named PrintArgs.jar:

    jar cvfm PrintArgs.jar META-INF/MANIFEST.MF *.class
    
  7. (Optional) To test it, from the root of the databricks_jar_test folder, run the JAR:

    java -jar PrintArgs.jar Hello World!
    
    # [Hello, World!]
    

    Note

    If you get the error no main manifest attribute, in PrintArgs.jar, be sure to add a newline to the end of the MANIFEST.MF file, and then try creating and running the JAR again.

  8. Upload PrintArgs.jar to a volume. See Upload files to a Unity Catalog volume.

Create a Scala JAR

  1. From the databricks_jar_test folder, create an empty file named build.sbt with the following contents:

    ThisBuild / scalaVersion := "2.12.14"
    ThisBuild / organization := "com.example"
    
    lazy val PrintArgs = (project in file("."))
      .settings(
        name := "PrintArgs"
      )
    
  2. From the databricks_jar_test folder, create the folder structure src/main/scala/example.

  3. In the example folder, create a file named PrintArgs.scala with the following contents:

    package example
    
    object PrintArgs {
      def main(args: Array[String]): Unit = {
        println(args.mkString(", "))
      }
    }
    
  4. Compile the program:

    sbt compile
    
  5. (Optional) Run the compiled program:

    sbt "run Hello World\!"
    
    # Hello, World!
    
  6. In the databricks_jar_test/project folder, create a file named assembly.sbt with the following contents:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.0.0")
    
  7. From the root of the databricks_jar_test folder, run the assembly command, which generates a JAR under the target folder:

    sbt assembly
    
  8. (Optional) To test it, from the root of the databricks_jar_test folder, run the JAR:

    java -jar target/scala-2.12/PrintArgs-assembly-0.1.0-SNAPSHOT.jar Hello World!
    
    # Hello, World!
    
  9. Upload PrintArgs-assembly-0.1.0-SNAPSHOT.jar to a volume. See Upload files to a Unity Catalog volume.

Step 3. Create a Databricks job to run the JAR

  1. Go to your Databricks landing page and do one of the following:

    • In the sidebar, click Workflows Icon Workflows and click Create Job Button.

    • In the sidebar, click New Icon New and select Job from the menu.

  2. In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example JAR example.

  3. For Task name, enter a name for the task, for example java_jar_task for Java, or scala_jar_task for Scala.

  4. For Type, select JAR.

  5. For Main class, for this example, enter PrintArgs for Java, or example.PrintArgs for Scala.

  6. For Cluster, select a compatible cluster. See Java and Scala library support.

  7. For Dependent libraries, click + Add.

  8. In the Add dependent library dialog, with Volumes selected, enter the location where you uploaded the JAR (PrintArgs.jar or PrintArgs-assembly-0.1.0-SNAPSHOT.jar) in the previous step into Volumes File Path, or filter or browse to find the JAR. Select it.

  9. Click Add.

  10. For Parameters, for this example, enter ["Hello", "World!"].

  11. Click Add.

Step 4: Run the job and view the job run details

Click Run Now Button to run the workflow. To view details for the run, click View run in the Triggered run pop-up or click the link in the Start time column for the run in the job runs view.

When the run completes, the output displays in the Output panel, including the arguments passed to the task.

Output size limits for JAR jobs

Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed.

To avoid encountering this limit, you can prevent stdout from being returned from the driver to Databricks by setting the spark.databricks.driver.disableScalaOutput Spark configuration to true. By default, the flag value is false. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster’s log files. Databricks recommends setting this flag only for job clusters for JAR jobs because it disables notebook results.

Recommendation: Use the shared SparkContext

Because Databricks is a managed service, some code changes might be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the SparkContext. Because Databricks initializes the SparkContext, programs that invoke new SparkContext() will fail. To get the SparkContext, use only the shared SparkContext created by Databricks:

val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()

There are also several methods you should avoid when using the shared SparkContext.

  • Do not call SparkContext.stop().

  • Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior.

Recommendation: Use try-finally blocks for job clean up

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.

  • jobCleanup() which has to be executed after jobBody(), whether that function succeeded or returned an exception.

For example, jobBody() creates tables and jobCleanup() drops those tables.

The safe way to ensure that the clean-up method is called is to put a try-finally block in the code:

try {
  jobBody()
} finally {
  jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Because of the way the lifetime of Spark containers is managed in Databricks, the shutdown hooks are not run reliably.

Configuring JAR job parameters

You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task object in the request body passed to the Create a new job operation (POST /jobs/create) in the Jobs API. To access these parameters, inspect the String array passed into your main function.

Manage library dependencies

The Spark driver has certain library dependencies that cannot be overridden. If your job adds conflicting libraries, the Spark driver library dependencies take precedence.

To get the full list of the driver library dependencies, run the following command in a notebook attached to a cluster configured with the same Spark version (or the cluster with the driver you want to examine):

%sh
ls /databricks/jars

When you define library dependencies for JARs, Databricks recommends listing Spark and Hadoop as provided dependencies. In Maven, add Spark and Hadoop as provided dependencies:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and Hadoop as provided dependencies:

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Specify the correct Scala version for your dependencies based on the version you are running.

Next steps

To learn more about creating and running Databricks jobs, see Create and run Databricks Jobs.