To get to the jobs page, click the Jobs icon in the sidebar.
In the Jobs list, you can filter jobs:
- Using key words.
- Selecting only jobs you own or jobs you have access to.
You can also click any column header to sort the list of jobs (either descending or ascending) by that column. By default, the page is sorted on job names in ascending order.
To create a new job, start by clicking Create Job at the upper left hand corner. There is a 1000 job limit for jobs created through the UI or through the Create endpoint.
Creating a job requires some configuration:
The notebook or JAR to run.
There are some significant differences between running notebook and JAR jobs. See Tips for running JAR jobs for more information.
The dependent libraries for the job. These are automatically attached to the cluster on launch.
The cluster this job will run on: you can select either a cluster that is currently launched or select a cluster that will launch when that job is run.
There is a tradeoff between running on a currently running cluster and a new cluster. We recommend running a fresh cluster for production-level jobs or jobs that are important to complete. Existing clusters work best for tasks such as updating dashboards at regular intervals.
Optional spark-submit parameters. Click Configure spark-submit to open the Set Parameters dialog, where you can enter spark-submit parameters as a JSON array.
Once you’ve created a job and are in the job detail page select Run Now and that job will execute immediately. Alternatively, you can schedule a job to run on a certain schedule.
Click Run Now to do a test run of your notebook or JAR when you’ve finished configuring your job. If your notebook fails, you can edit it and the job will automatically run the new version of the notebook.
The Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will fire immediately upon service availability.
Instead of Run Now, you can also click Run Now with Different Parameters to trigger the notebook job with a set of parameters different from the Job parameters.
The provided parameters are merged with the default parameters for the triggered run. If you delete keys,
the default value in
base_parameters are used.
From your scheduled job page, you can access the logs from different runs of your job. Select the run from the job detail page and you’ll be able to see the relevant details and job output. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend you to save job run results through the UI before they expire. For more information see Export job run results.
Then you can view the standard error, standard output, as well as the Spark UI logs for your job.
Export job run results¶
It is possible to persist old job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later imported back to your Databricks workspace. For more information see Importing Notebooks.
Similarly, you can also manually export the logs for your job run. If you’d like to automate this process, you can set up your job so that it automatically delivers logs to S3 or DBFS through the jobs API. For more information see the fields NewCluster and ClusterLogConf in the jobs Create API call.
Because Databricks is a managed service, some code changes may be necessary to ensure that your
Apache Spark jobs run correctly. JAR job programs must use the shared
SparkContext API to get
SparkContext. Because Databricks initializes the
SparkContext, programs that invoke
new SparkContext() will fail.
Parameterizing JAR jobs¶
JAR jobs are parameterized with an array of strings. In the UI, parameters are put into the Arguments text box and are split into an array by applying POSIX shell parsing rules. For more information reference the shlex documentation. In the API, the parameters are input as a standard JSON array. For more information, reference SparkJarTask.
To access these parameters, inspect the
String array passed into your
The Databricks’ Spark driver has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).
%sh ls /databricks/jars
Manage library dependencies¶
A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as
provided dependencies. On Maven, add Spark and/or Hadoop as provided dependencies as shown below.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> <scope>provided</scope> </dependency>
In sbt, add Spark and/or Hadoop as provided dependencies as shown below.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided" libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"
Specify the correct Scala version for your dependencies based on the version you are running.
Job access control is available only in the Databricks Operational Security Package.
Job access controls enable job owners and administrators to grant fine grained permissions on their jobs. With job access controls job owners can choose which other users or groups can view results of the job. Owners can also choose who can manage runs of their job (i.e. invoke run now and cancel.)
There are 5 different permission levels for jobs: No Permissions, Can View, Can Manage Run, Is Owner, and Can Manage. Note that the Can Manage permission is reserved for administrators.
|Abilities||No Permissions||Can View||Can Manage Run||Is Owner||Can Manage (admin)|
|View job details and settings||x||x||x||x||x|
|View results, Spark UI, logs of a job run||x||x||x||x|
|Edit job settings||x||x|
See Jobs Access Control for more details.
There are optional settings that you can specify when running a job. These include:
Alerts: Set up email alerts for your job to notify users in case of failure, success, or timeout.
Timeout: Configure the maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.
Retries: Set a policy so that failed runs are automatically retried.
Maximum concurrent runs: Configure the maximum number of runs which you can execute in parallel. Upon starting a new run, Databricks skips the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.
You can set email alerts for job runs. On the Jobs page, click the arrow next to Advanced and click Edit next to Alerts. You can send alerts upon job start, job success, and job failure (including skipped jobs), providing multiple comma-separated email addresses for each alert type. You can also opt out of alerts for skipped job runs.
Integrate these email alerts with your favorite notification tools, including:
Apache Airflow (incubating), a project started at Airbnb, is a popular solution for managing and scheduling complex dependencies in your data pipelines. In addition to its DAG scheduling framework, Airflow also provides tight integrations between Databricks and Airflow. With these integrations, you can take advantage of the complex scheduling features of Airflow without losing the optimized Spark engine offered by Databricks. This user guide describes the integrations in more detail.
For more general information about Airflow itself, take a look at the Apache Airflow (incubating) documentation.
The integration between Airflow and Databricks is available in Airflow version 1.9.0. To install Airflow locally with the Databricks integration, run:
pip install "apache-airflow[databricks]"
To install extras (for example
pip install "apache-airflow[databricks, celery, s3, password]"
Airflow represents data pipelines as directed acyclic graphs (DAGs) of operations where an edge represents a logical dependency between operations. With the Databricks integration, you can use the DatabricksSubmitRunOperator as a node in your DAG of computations. This operator matches our Runs Submit API endpoint and allows you to programmatically run notebooks and JARs uploaded to S3 or DBFS. For example usage of this operator, look at the file example_databricks_operator.py.
For more documentation on this operator, see the API documentation.
To use the
DatabricksSubmitRunOperator you must provide credentials in the appropriate Airflow Connection.
By default, if you do not specify the
databricks_conn_id parameter to the
DatabricksSubmitRunOperator the operator tries to find credentials in the connection with the ID equal to
You can configure Airflow connections through the Airflow web UI as instructed in Connections.
For the Databricks connection, set the
Host field to the hostname of your Databricks deployment, the
Login to your Databricks username, and the
Password field to your Databricks password.