Best practices: pools

This article explains what pools are, and how you can best configure them. For information on creating a pool, see Create a pool.

What are Databricks pools?

Databricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.

You can specify a different pool for the driver node and worker nodes, or use the same pool for both.

Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply. See pricing.

You can manage pools using the UI, the Instance Pools CLI (legacy), or by calling the Instance Pools API.

Pool recommendations

The Databricks platform provides an efficient and cost-effective way to manage your analytics infrastructure. Databricks recommends the following best practices when you use pools:

  • Create pools using instance types and Databricks runtimes based on target workloads.

  • When possible, populate pools with spot instances to reduce costs.

  • Populate pools with on-demand instances for jobs with short execution times and strict execution time requirements.

  • Use pool tags and cluster tags to manage billing.

  • Use pool configuration options to minimize cost.

  • Pre-populate pools to make sure instances are available when clusters need them.

Create pools based on workloads

If your driver node and worker nodes have different requirements, create a different pool for each.

You can minimize instance acquisition time by creating a pool for each instance type and Databricks runtime your organization commonly uses. For example, if most data engineering clusters use instance type A, data science clusters use instance type B, and analytics clusters use instance type C, create a pool with each instance type.

Configure pools to use on-demand instances for jobs with short execution times and strict execution time requirements. Use on-demand instances to prevent acquired instances from being lost to a higher bidder on the spot market.

Configure pools to use spot instances for clusters that support interactive development or jobs that prioritize cost savings over reliability.

Tag pools to manage cost and billing

Tagging pools to the correct cost center allows you to manage cost and usage chargeback. You can use multiple custom tags to associate multiple cost centers to a pool. However, it’s important to understand how tags are propagated when a cluster is created from pools. Tags from pools propagate to the underlying cloud provider instances, but the cluster’s tags do not. Apply all custom tags required for managing chargeback of the cloud provider compute cost to the pool.

Pool tags and cluster tags both propagate to Databricks billing. You can use the combination of cluster and pool tags to manage chargeback of Databricks Units.

To learn more, see Monitor usage using tags.

Configure pools to control cost

You can use the following configuration options to help control the cost of pools:

  • Set the Min Idle instances to 0 to avoid paying for running instances that aren’t doing work. The tradeoff is a possible increase in time when a cluster needs to acquire a new instance.

  • Set the Idle Instance Auto Termination time to provide a buffer between when the instance is released from the cluster and when it’s dropped from the pool. Set this to a period that allows you to minimize cost while ensuring the availability of instances for scheduled jobs. For example, job A is scheduled to run at 8:00 AM and takes 40 minutes to complete. Job B is scheduled to run at 9:00 AM and takes 30 minutes to complete. Set the Idle Instance Auto Termination value to 20 minutes to ensure that instances returned to the pool when job A completes are available when job B starts. Unless they are claimed by another cluster, those instances are terminated 20 minutes after job B ends.

  • Set the Max Capacity based on anticipated usage. This sets the ceiling for the maximum number of used and idle instances in the pool. If a job or cluster requests an instance from a pool at its maximum capacity, the request fails, and the cluster doesn’t acquire more instances. Therefore, Databricks recommends that you set the maximum capacity only if there is a strict instance quota or budget constraint.

Pre-populate pools

To benefit fully from pools, you can pre-populate newly created pools. Set the Min Idle instances greater than zero in the pool configuration. Alternatively, if you’re following the recommendation to set this value to zero, use a starter job to ensure that newly created pools have available instances for clusters to access.

With the starter job approach, schedule a job with flexible execution time requirements to run before jobs with more strict performance requirements or before users start using interactive clusters. After the job finishes, the instances used for the job are released back to the pool. Set Min Idle instance setting to 0 and set the Idle Instance Auto Termination time high enough to ensure that idle instances remain available for subsequent jobs.

Using a starter job allows the pool instances to spin up, populate the pool, and remain available for downstream job or interactive clusters.