Jobs

Jobs are a way of running a Notebooks or jar either immediately or on a consistent basis. Jobs outcomes are visible inside of the UI, from the REST API and through email alerts.

Viewing Jobs

To get to the jobs page, select the Jobs icon Jobs Menu from the menu on the left hand side.

In the job list page, you can filter jobs with key words to show jobs of interest. You can also click any column header in order to sort the list of jobs (either descending or ascending) by that column. The page is by default sorted on job names in an ascending order. The order is indicated by the arrow in the header. Up arrow means ascending and down arrow means descending.

Job List

Creating a Job

To create a new job, start by clicking on Create Job at the upper left hand corner.

Job Setup

Creating a job requires some configuration:

  • The notebook or jar you would like to run.

Note

There are some significant difference between running notebook and jar jobs. Please see the Tips for Running Jar Jobs for more information.

  • The dependant Libraries for this job
    • These will be automatically attached to the cluster on launch
  • The cluster this job will run on: you can select either a cluster that is currently launched or select a cluster that will launch when that job is.

Note

There are is a distinct trade off between running on a currently running cluster and a new cluster. We recommend running a fresh cluster for production level jobs or ones that are important to complete. Using existing clusters for jobs is not recommended for production jobs and work best for things like updating Dashboards at regular intervals.

Running a Job

Running a job is simple. Once you’ve created a job and are in the job detail page - select Run Now and that job will be able to execute immediately. Alternatively, you can schedule a job to run on a certain schedule.

Tip

Click on Run Now to do a test run of your notebook or JAR now that you’ve finished configuring your job. If your notebook fails, you can just edit it and the job will automatically run the new version of the notebook.

Note

The Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will fire immediately upon service availability.

Viewing Old Job Runs

From your scheduled job page, you can access the logs from different runs of your job. Select the run from the job detail page and you’ll be able to see the relevant details and job output. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend you to save job run results through the UI before they expire. For more information see Exporting Job Run Results.

Job Runs List

Then you can view the standard error, standard output, as well as the Spark UI logs for your job.

Job Run Log

Exporting Job Run Results

It is possible to persist old job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later imported back to your Databricks workspace. For more information see Importing Notebooks.

Export Notebook Run

Similarly, you can also manually export the logs for your job run. If you’d like to automate this process, you can set up your job so that it automatically delivers logs to either DBFS or S3 through the jobs API. For more information see the fields NewCluster and ClusterLogConf in the jobs Create API call.

Tips for Running Jar Jobs

Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. Jar job programs must use the shared Spark Context API to get the Spark Context (details below). Programs that invoke new SparkContext() will fail inside of Databricks since Databricks already initialized the Spark Context.

Use the Shared Spark Context API

To get the Spark Context, you should only be using the shared Spark Context created by Databricks. You can access this via the API below.

val goodSparkContext = SparkContext.getOrCreate()

Warning

You should not manually create a Spark Context using the constructor like this

import org.apache.spark.SparkConf
val badSparkContext = new SparkContext(new SparkConf().setAppName("My Spark Job").setMaster("local"))

In addition, you should not stop the Spark Context inside your JAR.

val dontStopTheSparkContext = SparkContext.getOrCreate()
dontStopTheSparkContext.stop()

Do NOT call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior to occur.

Editing a Job

You can edit any job that you’ve created by navigating to it from the jobs list page.

Deleting a Job

A user can delete jobs from the job list page by clicking on the blue “x” for a given job.

Jobs Settings and Advanced Usage

Library Dependencies

The Databricks’ Spark driver has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.

To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same spark version (or the cluster with the driver you want to examine).
%sh ls /databricks/jars

Tips on Dealing with Library Dependencies

A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On maven, add Spark and/or Hadoop as provided dependencies as shown below.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.10</artifactId>
  <version>1.5.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and/or Hadoop as provided dependencies as shown below.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Please be sure to specify the correct scala version for your dependencies based on the version you are running.

Advanced Options

There are optional settings that you may specify when you’re running your job. These include:

  • Alerts: Set up email alerts for your job to notify users in case of failure, success, or timeout.
  • Timeout: Configure the maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.
  • Retries: Set a policy so that failed runs will be automatically retried.
../_images/retry-policy.png

New in version 2.34.

  • Maximum concurrent runs: Configure the maximum number of runs which you can execute in parallel. Upon starting a new run, Databricks will skip the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.

Third-Party Integrations

Apache Airflow (incubating)

Apache Airflow (incubating), a project started at Airbnb, is a popular solution for managing and scheduling complex dependencies in your data pipelines. In addition to its DAG scheduling framework, Airflow also provides tight integrations between Databricks and Airflow. With these integrations, you can take advantage of the complex scheduling features of Airflow without losing the optimized Spark engine offered by Databricks. This user guide describes the integrations in more detail.

For more general information about Airflow itself, take a look at their docs.

Installation

The integrations between Airflow and Databricks have been contributed upstream to the open-source Airflow project in the master branch. However, the integrations will not be cut into a release branch until Airflow 1.9.0 is released. Until then, Databricks will maintain a fork of the Airflow project with the Databricks integrations applied.

The naming scheme for the version/tag name is the Airflow version appended with the Databricks version. For example, Airflow 1.8.1 with Databricks integration is under the tag 1.8.1-db1.

To install Airflow locally with Databricks integration, simply run

pip install "git+git://github.com/databricks/incubator-airflow.git@1.8.1-db1#egg=apache-airflow[databricks]"

For other extras (for example celery, s3, and password), install them like this

pip install "git+git://github.com/databricks/incubator-airflow.git@1.8.1-db1#egg=apache-airflow[databricks, celery, s3, password]"

Releases

There currently is only one Databricks specific release of Airflow. This release includes the following commits.

DatabricksSubmitRunOperator

Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations where an edge represents a logical dependency between operations. With the Databricks integration, you can use the DatabricksSubmitRunOperator as a node in your DAG of computations. This operator matches our Runs Submit API endpoint and allows you to programatically run notebooks and jars uploaded to S3/DBFS. For example usage of this operator look at the file example_databricks_operator.py on github.

More documentation on this operator can be found here.

Configuring the Databricks Connection

To use the DatabricksSubmitRunOperator you must provide credentials in the appropriate Airflow Connection. By default, if you do not specify the databricks_conn_id parameter to the DatabricksSubmitRunOperator the operator will try to find credentials in the connection with the id equal to databricks_default. Airflow connections can easily be configured through the Airflow web UI as instructed here. For the databricks connection, set the Host field to the hostname of your Databricks deployment, the Login to your Databricks username, and the Password field to your Databricks password.

Airflow Connection Configuration