Clusters provide the computation resources and configurations that run your notebooks and jobs. Clusters run on instances provisioned by your cloud provider on demand. The Databricks platform provides an efficient and cost-effective way to manage your analytics infrastructure. This article shows how to address the following challenges when creating new clusters or scaling up existing clusters:
- The execution time of your Databricks job might be shorter than the time to provision instances and start a new cluster.
- When autoscaling is enabled on a cluster, it takes time for the cloud provider to provision new instances. This can negatively impact jobs with strict performance requirements or varying workloads.
Databricks pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use instances.
You can use a different pool for the driver node and worker nodes.
For an introduction to pools and configuration recommendations, view the following video:
As shown in the following diagram, when a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s available instances. If the pool has no available instances, it expands by allocating a new instance from the cloud provider to accommodate the cluster’s request. When a cluster releases an instance, the instance returns to the pool and is free for use by another cluster. Only clusters attached to a pool can use that pool’s available instances.
This article discusses the following best practices to ensure the best performance at the lowest cost when you use pools:
- Create pools using instance types and Databricks runtimes based on target workloads.
- When possible, populate pools with spot instances to reduce costs.
- Populate pools with on-demand instances for jobs with short execution times and strict execution time requirements.
- Use pool tags and cluster tags to manage billing.
- Use pool configuration options to minimize cost.
- Pre-populate pools to make sure instances are available when clusters need them.
If your driver node and worker nodes have different requirements, create a different pool for each.
You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type.
Configure <instance pools to use on-demand instances for jobs with short execution times and strict execution time requirements. Use on-demand instances to prevent acquired instances from being lost to a higher bidder on the spot market.
Configure pools to use spot instances for clusters that support interactive development or jobs that prioritize cost savings over reliability.
Tagging pools to the correct cost center allows you to manage cost and usage chargeback. You can use multiple custom tags to associate multiple cost centers to a pool. However, it’s important to understand how tags are propagated when a cluster is created from pools. As shown in the following diagram, tags from the pools propagate to the underlying cloud provider instances, but the cluster’s tags do not. Apply all custom tags required for managing chargeback of the cloud provider compute cost to the pool.
pool tags and cluster tags both propagate to Databricks billing. You can use the combination of cluster and pool tags to manage chargeback of Databricks Units.
To learn more, see Monitor usage using cluster and pool tags
You can use the following configuration options to help control the cost of pools:
- Set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. The tradeoff is a possible increase in time when a cluster needs to acquire a new instance.
- Set the Idle Instance Auto Termination time to provide a buffer between when the instance is released from the cluster and when it’s dropped from the pool. Set this to a period that allows you to minimize cost while ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete. Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool when job A completes are available when job B starts. Unless they are claimed by another cluster, those instances are terminated 20 minutes after job B ends.
- Set the Max Capacity based on anticipated usage. This sets the ceiling for the maximum number of used and idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the request fails, and the cluster doesn’t acquire more instances. Therefore, Databricks recommends that you set the maximum capacity only if there is a strict instance quota or budget constraint.
To benefit fully from pools, you can pre-populate newly created pools. Set the Min Idle instances greater than zero in the pool configuration. Alternatively, if you’re following the recommendation to set this value to zero, use a starter job to ensure that newly created pools have available instances for clusters to access.
With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with more strict performance requirements or before users start using interactive clusters. After the job finishes, the instances used for the job are released back to the pool. Set Min Idle instance setting to 0 and set the Idle Instance Auto Termination time high enough to ensure that idle instances remain available for subsequent jobs.
Using a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream job or interactive clusters.
Learn more about Databricks pools.