Clusters are a core component of the Databricks unified analytics platform, providing the computation resources and configurations that run your notebooks and jobs. Databricks clusters run on instances provisioned by your cloud provider, allowing you to spin clusters up and bring them down based on demand. The Databricks platform provides an efficient and cost-effective way to manage your analytics infrastructure but can present challenges when creating new clusters or scaling up existing clusters:
- The execution time of your Databricks job might be shorter than the time to provision instances and start a new cluster.
- Waiting for the cloud provider to provision instances can impose a performance penalty when an auto-scaling cluster needs to scale up.
The additional time to create or scale clusters can impact jobs with strict performance requirements or deal with varying workloads. Fortunately, you can use Databricks pools to address these challenges. Pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use instances.
For an introduction to pools and configuration recommendations, view the Databricks pools video:
As shown in the following diagram, when a cluster attached to a pool needs an instance, it first attempts to allocate one of the pool’s available instances. If the pool has no available instances, it expands by allocating a new instance from the cloud provider to accommodate the cluster’s request. When a cluster releases an instance, the instance returns to the pool and is free for use by another cluster. Only clusters attached to a pool can use that pool’s available instances.
This article discusses the following best practices to ensure the best performance at the lowest cost when you use Databricks pools:
- Create pools using instance types and Databricks runtimes based on target workloads.
- When possible, populate pools with spot instances to reduce costs.
- Populate pools with on-demand instances for jobs with short execution times and strict execution time requirements.
- Use pool and cluster tags to manage billing.
- Use pool configuration options to minimize cost.
- Pre-populate pools to make sure instances are available when clusters need them.
Because pools support only homogenous instance types (that is the driver and worker instance types must be the same), you can minimize instance acquisition time by creating a pool for each common instance type and Databricks runtime. For example, if most of the data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create three pools: one with instance type A, one with instance type B, and one with instance type C.
Configure pools to use on-demand instances for jobs with short execution times and strict execution time requirements. On-demand instances will prevent acquired instances from being lost to a higher bidder on the spot market.
Configure pools to use spot instances for clusters that support interactive development or jobs that prioritize cost savings over reliability.
Tag pools to the right cost center to best manage cost and usage chargeback. Use multiple custom tags if you need to associate multiple cost centers to a pool. However, it’s important to understand how tags are propagated when a cluster is created from a pool. As shown in the following diagram, only the pool’s tags propagate to the underlying cloud provider instances. The cluster’s tags are not propagated to the instances. This means you should apply all custom tags required for managing chargeback of the cloud provider compute cost to the pool.
Pool tags and cluster tags both propagate to Databricks billing, so it is still a best practice to apply cluster tags for managing chargeback of Databricks Units.
To learn more, see Monitor usage using cluster and pool tags
You can use the following configuration options to help control the cost of pools:
- Set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. The tradeoff is a possible increase in time when a cluster needs to acquire a new instance.
- Set the Idle Instance Auto Termination time to provide a buffer between when the instance is released from the cluster and when it’s dropped from the pool. Set this to a period that allows you to minimize cost while ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete. Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool when job A completes are available when job B starts. Unless they are claimed by another cluster, those instances are terminated 20 minutes after job B ends.
- Set the Max Capacity based on anticipated usage. This sets the ceiling for the maximum number of used and idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the request will fail, and the cluster will not acquire more instances. Therefore, Databricks recommends that you set the maximum capacity only if there is a strict instance quota or budget constraint.
To fully benefit from Databricks pools, you should make sure that any newly created pools are pre-populated. You can do this by setting the Min Idle instances greater than zero in the pool configuration. Alternatively, if you’re following the recommendation to set this value to zero, you can use a starter job to ensure that newly created pools have available instances for clusters to access.
With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with more strict performance requirements or before users start using interactive clusters. After the job finishes, the instances used for the job are released back to the pool. The pool should be configured with a Min Idle instance setting of 0 and an Idle Instance Auto Termination time that is high enough to ensure that subsequently scheduled jobs or interactive clusters benefit from pulling available instances from the pool rather than the cloud provider.
The starter job concept allows the proper timing for the pool instances to spin up, populate the pool, and be available for downstream job or interactive clusters.
Learn more about Databricks pools.