Jobs

Jobs are a way of running a notebook or JAR either immediately or on a scheduled basis. Jobs outcomes are visible in the UI, by querying the Job API, and through email alerts.

View jobs

To get to the jobs page, click the Jobs icon Jobs Menu in the sidebar.

In the Jobs list, you can filter jobs:

  • Using key words.
  • Selecting only jobs you own or jobs you have access to.

You can also click any column header to sort the list of jobs (either descending or ascending) by that column. By default, the page is sorted on job names in ascending order.

Job List

Create a job

To create a new job, start by clicking Create Job at the upper left hand corner. There is a 1000 job limit for jobs created through the UI or through the Create endpoint.

Job Conf

Creating a job requires some configuration:

  • The notebook or JAR to run.

    Note

    There are some significant differences between running notebook and JAR jobs. See Tips for running JAR jobs for more information.

  • The dependent libraries for the job. These are automatically attached to the cluster on launch.

  • The cluster this job will run on: you can select either a cluster that is currently launched or select a cluster that will launch when that job is run.

    Note

    There is a tradeoff between running on a currently running cluster and a new cluster. We recommend running a fresh cluster for production-level jobs or jobs that are important to complete. Existing clusters work best for tasks such as updating dashboards at regular intervals.

  • Optional spark-submit parameters. Click Configure spark-submit to open the Set Parameters dialog, where you can enter spark-submit parameters as a JSON array.

    Set Parameters

Run a job

Once you’ve created a job and are in the job detail page select Run Now and that job will execute immediately. Alternatively, you can schedule a job to run on a certain schedule.

Tip

Click Run Now to do a test run of your notebook or JAR when you’ve finished configuring your job. If your notebook fails, you can edit it and the job will automatically run the new version of the notebook.

Note

The Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will fire immediately upon service availability.

Run a notebook job with different parameters

Instead of Run Now, you can also click Run Now with Different Parameters to trigger the notebook job with a set of parameters different from the Job parameters.

Job Conf Notebook Params

Run Now With Different Params

Note

The provided parameters are merged with the default parameters for the triggered run. If you delete keys, the default value in base_parameters are used.

View old job runs

From your scheduled job page, you can access the logs from different runs of your job. Select the run from the job detail page and you’ll be able to see the relevant details and job output. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend you to save job run results through the UI before they expire. For more information see Export job run results.

Job Runs List

Then you can view the standard error, standard output, as well as the Spark UI logs for your job.

Job Run Log

Export job run results

It is possible to persist old job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later imported back to your Databricks workspace. For more information see Importing Notebooks.

Export Notebook Run

Similarly, you can also manually export the logs for your job run. If you’d like to automate this process, you can set up your job so that it automatically delivers logs to S3 or DBFS through the jobs API. For more information see the fields NewCluster and ClusterLogConf in the jobs Create API call.

Tips for running JAR jobs

Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext API to get the SparkContext. Because Databricks initializes the SparkContext, programs that invoke new SparkContext() will fail.

The shared SparkContext API

To get the SparkContext, use only the shared SparkContext created by Databricks:

val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()

Warning

There are several methods you must avoid when using the shared SparkContext.

  • Do not manually create a SparkContext using the constructor:

    import org.apache.spark.SparkConf
    val badSparkContext = new SparkContext(new SparkConf().setAppName("My Spark Job").setMaster("local"))
    
  • Do not stop SparkContext inside your JAR:

    val dontStopTheSparkContext = SparkContext.getOrCreate()
    dontStopTheSparkContext.stop()
    
  • Do not call System.exit(0) or sc.stop() at the end of your Main program. This can cause undefined behavior.

Parameterizing JAR jobs

JAR jobs are parameterized with an array of strings. In the UI, parameters are put into the Arguments text box and are split into an array by applying POSIX shell parsing rules. For more information reference the shlex documentation. In the API, the parameters are input as a standard JSON array. For more information, reference SparkJarTask.

To access these parameters, inspect the String array passed into your main function.

Edit a job

You edit a job by navigating to it from the Jobs list page.

Delete a job

You delete jobs in the Jobs list page by clicking the blue x in the job row.

Jobs settings and advanced usage

Library dependencies

The Databricks’ Spark driver has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.

To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).
%sh ls /databricks/jars

Manage library dependencies

A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On Maven, add Spark and/or Hadoop as provided dependencies as shown below.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.10</artifactId>
  <version>1.5.0</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
  <scope>provided</scope>
</dependency>

In sbt, add Spark and/or Hadoop as provided dependencies as shown below.

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"

Tip

Specify the correct Scala version for your dependencies based on the version you are running.

Job access control

Note

Job access control is available only in the Databricks Operational Security Package.

Job access controls enable job owners and administrators to grant fine grained permissions on their jobs. With job access controls job owners can choose which other users or groups can view results of the job. Owners can also choose who can manage runs of their job (i.e. invoke run now and cancel.)

There are 5 different permission levels for jobs: No Permissions, Can View, Can Manage Run, Is Owner, and Can Manage. Note that the Can Manage permission is reserved for administrators.

Abilities No Permissions Can View Can Manage Run Is Owner Can Manage (admin)
View job details and settings x x x x x
View results, Spark UI, logs of a job run   x x x x
Run now     x x x
Cancel run     x x x
Edit job settings       x x
Modify permissions       x x
Delete job       x x
Change owner         x

See Jobs Access Control for more details.

Advanced options

There are optional settings that you can specify when running a job. These include:

  • Alerts: Set up email alerts for your job to notify users in case of failure, success, or timeout.

  • Timeout: Configure the maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.

  • Retries: Set a policy so that failed runs are automatically retried.

    ../_images/retry-policy.png
  • Maximum concurrent runs: Configure the maximum number of runs which you can execute in parallel. Upon starting a new run, Databricks skips the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.

Job alerts

You can set email alerts for job runs. On the Jobs page, click the arrow next to Advanced and click Edit next to Alerts. You can send alerts upon job start, job success, and job failure (including skipped jobs), providing multiple comma-separated email addresses for each alert type. You can also opt out of alerts for skipped job runs.

../_images/job-alerts.png

Integrate these email alerts with your favorite notification tools, including:

Apache Airflow (incubating)

Apache Airflow (incubating), a project started at Airbnb, is a popular solution for managing and scheduling complex dependencies in your data pipelines. In addition to its DAG scheduling framework, Airflow also provides tight integrations between Databricks and Airflow. With these integrations, you can take advantage of the complex scheduling features of Airflow without losing the optimized Spark engine offered by Databricks. This user guide describes the integrations in more detail.

For more general information about Airflow itself, take a look at the Apache Airflow (incubating) documentation.

The integration between Airflow and Databricks is available in Airflow version 1.9.0. To install Airflow locally with the Databricks integration, run:

pip install "apache-airflow[databricks]"

To install extras (for example celery, s3, and password), run:

pip install "apache-airflow[databricks, celery, s3, password]"

Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations where an edge represents a logical dependency between operations. With the Databricks integration, you can use the DatabricksSubmitRunOperator as a node in your DAG of computations. This operator matches our Runs Submit API endpoint and allows you to programmatically run notebooks and JARs uploaded to S3 or DBFS. For example usage of this operator, look at the file example_databricks_operator.py.

For more documentation on this operator, see the API documentation.

To use the DatabricksSubmitRunOperator you must provide credentials in the appropriate Airflow Connection. By default, if you do not specify the databricks_conn_id parameter to the DatabricksSubmitRunOperator the operator tries to find credentials in the connection with the ID equal to databricks_default.

You can configure Airflow connections through the Airflow web UI as instructed in Connections. For the Databricks connection, set the Host field to the hostname of your Databricks deployment, the Login to your Databricks username, and the Password field to your Databricks password.

../_images/airflow-connection-configuration.png