To get to the jobs page, select the Jobs icon from the menu on the left hand side.
Creating a Job¶
To create a new job, start by clicking on at the upper left hand corner.
Creating a job requires some configuration:
- The notebook or jar you would like to run.
There are some significant difference between running notebook and jar jobs. Please see the Tips for Running Jar Jobs for more information.
- The dependant Libraries for this job
- These will be automatically attached to the cluster on launch
- The cluster this job will run on: you can select either a cluster that is currently launched or select a cluster that will launch when that job is.
There are is a distinct trade off between running on a currently running cluster and a new cluster. We recommend running a fresh cluster for production level jobs or ones that are important to complete. Using existing clusters for jobs is not recommended for production jobs and work best for things like updating Dashboards at regular intervals.
Running a Job¶
Running a job is simple. Once you’ve created a job and are in the job detail page - select
Run Now and that job will be able to execute immediately. Alternatively, you can schedule a job to run on a certain schedule.
Click on Run Now to do a test run of your notebook or JAR now that you’ve finished configuring your job. If your notebook fails, you can just edit it and the job will automatically run the new version of the notebook.
The Databricks job scheduler, like the Spark batch interface, is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will fire immediately upon service availability.
Viewing Old Job Runs¶
From your scheduled job page, you can access the logs from different runs of your job. Select the run from the job detail page and you’ll be able to see the relevant details and job output. Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, we recommend you to save job run results through the UI before they expire. For more information see Exporting Job Run Results.
Then you can view the standard error, standard output, as well as the Spark UI logs for your job.
Exporting Job Run Results¶
It is possible to persist old job runs by exporting their results. For notebook job runs, you can export a rendered notebook which can be later imported back to your Databricks workspace. For more information see Importing Notebooks.
Similarly, you can also manually export the logs for your job run. If you’d like to automate this process, you can set up your job so that it automatically delivers logs to either DBFS or S3 through the jobs API. For more information see the fields NewCluster and ClusterLogConf in the jobs Create API call.
Tips for Running Jar Jobs¶
Because Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. Jar job programs must use the shared Spark Context API to get the Spark Context (details below). Programs that invoke
new SparkContext() will fail inside of Databricks since Databricks already initialized the Spark Context.
Editing a Job¶
You can edit any job that you’ve created by navigating to it from the jobs list page.
Deleting a Job¶
A user can delete jobs from the job list page by clicking on the blue “x” for a given job.
Jobs Settings and Advanced Usage¶
The Databricks’ Spark driver has certain library dependencies that cannot be overridden. These libraries will take priority over any of your own libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same spark version (or the cluster with the driver you want to examine).
%sh ls /databricks/jars
Tips on Dealing with Library Dependencies¶
A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided dependencies. On maven, add Spark and/or Hadoop as provided dependencies as shown below.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> <scope>provided</scope> </dependency>
In sbt, add Spark and/or Hadoop as provided dependencies as shown below.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided" libraryDependencies += "org.apache.spark" %% "hadoop-core" % "1.2.1" % "provided"
Please be sure to specify the correct scala version for your dependencies based on the version you are running.
There are optional settings that you may specify when you’re running your job. These include:
- Alerts: Set up email alerts for your job to notify users in case of failure, success, or timeout.
- Timeout: Configure the maximum completion time for a job. If the job does not complete in this time, Databricks sets its status to “Timed Out”.
- Retries: Set a policy so that failed runs will be automatically retried.
New in version 2.34:
Maximum concurrent runs: Configure the maximum number of runs which you can execute in parallel. Upon starting a new run, Databricks will skip the run if the job has already reached its maximum number of active runs. Set this value higher than the default of 1 if you want to be able to execute multiple runs of the same job concurrently. This is useful for example if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or if you want to trigger multiple runs which differ by their input parameters.