To get to the jobs page, select the Jobs icon from the menu on the left hand side.
In the job list page, you can filter jobs with key words to show jobs of interest. You can also click any column header in order to sort the list of jobs (either descending or ascending) by that column. The page is by default sorted on job names in an ascending order. The order is indicated by the arrow in the header. Up arrow means ascending and down arrow means descending.
Creating a Job¶
To create a new job, start by clicking on at the upper left hand corner.
Creating a job requires some configuration:
- The notebook or jar you would like to run.
There are some significant difference between running notebook and jar jobs. Please see the Tips for Running Jar Jobs for more information.
- The dependant Libraries for this job
- These will be automatically attached to the cluster on launch
- The cluster this job will run on: you can select either a cluster that is currently launched or select a cluster that will launch when that job is.
There are is a distinct trade off between running on a currently running cluster and a new cluster. We recommend running a fresh cluster for production level jobs or ones that are important to complete. Using existing clusters for jobs is not recommended for production jobs and work best for things like updating Dashboards at regular intervals.
Running a Job¶
Running a job is simple. Once you’ve created a job and are in the job detail page - select
Run Now and that job will be able to execute immediately. Alternatively, you can schedule a job to run on a certain schedule.
Click on Run Now to do a test run of your notebook or JAR now that you’ve finished configuring your job. If your notebook fails, you can just edit it and the job will automatically run the new version of the notebook.
The Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will fire immediately upon service availability.
Viewing Old Job Runs¶
From your scheduled job page, you can access the logs from different runs of your job. Select the run from the job detail page and you’ll be able to see the relevant details and job output. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend you to save job run results through the UI before they expire. For more information see Exporting Job Run Results.
Then you can view the standard error, standard output, as well as the Spark UI logs for your job.
Exporting Job Run Results¶
It is possible to persist old job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later imported back to your Databricks workspace. For more information see Importing Notebooks.
Similarly, you can also manually export the logs for your job run. If you’d like to automate this process, you can set up your job so that it automatically delivers logs to either DBFS or S3 through the jobs API. For more information see the fields NewCluster and ClusterLogConf in the jobs Create API call.
Tips for Running Jar Jobs¶
Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. Jar job programs must use the shared Spark Context API to get the Spark Context (details below). Programs that invoke
new SparkContext() will fail inside of Databricks since Databricks already initialized the Spark Context.
Editing a Job¶
You can edit any job that you’ve created by navigating to it from the jobs list page.
Deleting a Job¶
A user can delete jobs from the job list page by clicking on the blue “x” for a given job.
Jobs Settings and Advanced Usage¶
The Databricks’ Spark driver has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same spark version (or the cluster with the driver you want to examine).
%sh ls /databricks/jars
Tips on Dealing with Library Dependencies¶
A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On maven, add Spark and/or Hadoop as provided dependencies as shown below.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> <scope>provided</scope> </dependency>
In sbt, add Spark and/or Hadoop as provided dependencies as shown below.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided" libraryDependencies += "org.apache.spark" %% "hadoop-core" % "1.2.1" % "provided"
Please be sure to specify the correct scala version for your dependencies based on the version you are running.
There are optional settings that you may specify when you’re running your job. These include:
- Alerts: Set up email alerts for your job to notify users in case of failure, success, or timeout.
- Timeout: Configure the maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.
- Retries: Set a policy so that failed runs will be automatically retried.
New in version 2.34.
- Maximum concurrent runs: Configure the maximum number of runs which you can execute in parallel. Upon starting a new run, Databricks will skip the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.
Apache Airflow (incubating)¶
Apache Airflow (incubating), a project started at Airbnb, is a popular solution for managing and scheduling complex dependencies in your data pipelines. In addition to its DAG scheduling framework, Airflow also provides tight integrations between Databricks and Airflow. With these integrations, you can take advantage of the complex scheduling features of Airflow without losing the optimized Spark engine offered by Databricks. This user guide describes the integrations in more detail.
For more general information about Airflow itself, take a look at their docs.
The integrations between Airflow and Databricks have been contributed upstream to the open-source Airflow project in the master branch. However, the integrations will not be cut into a release branch until Airflow 1.9.0 is released. Until then, Databricks will maintain a fork of the Airflow project with the Databricks integrations applied.
The naming scheme for the version/tag name is the Airflow version appended with the Databricks version. For example, Airflow 1.8.1 with Databricks integration is under the tag 1.8.1-db1.
To install Airflow locally with Databricks integration, simply run
pip install "git+git://email@example.com#egg=apache-airflow[databricks]"
For other extras (for example
password), install them like this
pip install "git+git://firstname.lastname@example.org#egg=apache-airflow[databricks, celery, s3, password]"
There currently is only one Databricks specific release of Airflow. This release includes the following commits.
Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations where an edge represents a logical dependency between operations. With the Databricks integration, you can use the DatabricksSubmitRunOperator as a node in your DAG of computations. This operator matches our Runs Submit API endpoint and allows you to programatically run notebooks and jars uploaded to S3/DBFS. For example usage of this operator look at the file example_databricks_operator.py on github.
More documentation on this operator can be found here.
Configuring the Databricks Connection¶
To use the
DatabricksSubmitRunOperator you must provide credentials in the appropriate Airflow Connection.
By default, if you do not specify the
databricks_conn_id parameter to the
DatabricksSubmitRunOperator the operator will try to find credentials in the connection with the id equal to
Airflow connections can easily be configured through the Airflow web UI as instructed here.
For the databricks connection, set the
Host field to the hostname of your Databricks deployment, the
Login to your Databricks username, and the
Password field to your Databricks password.