Use Databricks compute with your jobs

When you run a Databricks job, the tasks configured as part of the job run on Databricks compute, either a cluster or a SQL warehouse depending on the task type. Selecting the compute type and configuration options is important when you operationalize a job. This article provides a guide to using Databricks compute resources to run your jobs.

Use shared job clusters

To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster:

  1. Select New Job Clusters when you create a task and complete the cluster configuration.

  2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select New Job Clusters is available to any task in the job.

A shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the same job.

Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.

Choose the correct cluster type for your job

  • New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task scoped clusters so that each job or task runs in a fully isolated environment.

  • When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.

  • If you select a terminated existing cluster and the job owner has Can Restart permission, Databricks starts the cluster when the job is scheduled to run.

  • Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.

Use a pool to reduce cluster start times

To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.

Automatic availability zones

To take advantage of automatic availability zones (Auto-AZ), you must enable it with the Clusters API, setting aws_attributes.zone_id = "auto". See Availability zones.