Use Databricks compute with your jobs
When you run a Databricks job, the tasks configured as part of the job run on Databricks compute, either a cluster or a SQL warehouse depending on the task type. Selecting the compute type and configuration options is important when you operationalize a job. This article provides a guide to using Databricks compute resources to run your jobs.
Choose the correct cluster type for your job
New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task scoped clusters so that each job or task runs in a fully isolated environment.
When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.
If you select a terminated existing cluster and the job owner has Can Restart permission, Databricks starts the cluster when the job is scheduled to run.
Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.
Use a pool to reduce cluster start times
To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.
Automatic availability zones
To take advantage of automatic availability zones (Auto-AZ), you must enable it with the Clusters API, setting
aws_attributes.zone_id = "auto". See Availability zones.