Compute configuration best practices

Databricks provides a number of options when you create and configure compute to help you get the best performance at the lowest cost. This flexibility, however, can create challenges when you’re trying to determine optimal configurations for your workloads. Carefully considering how users will utilize compute will help guide configuration options when you create new compute or configure existing compute. Some of the things to consider when determining configuration options are:

  • What type of user will be using the compute? A data scientist may be running different job types with different requirements than a data engineer or data analyst.

  • What types of workloads will users run on the compute? For example, batch extract, transform, and load (ETL) jobs will likely have different requirements than analytical workloads.

  • What level of service level agreement (SLA) do you need to meet?

  • What budget constraints do you have?

This article provides compute configuration recommendations for different scenarios based on these considerations. This article also discusses specific features of Databricks compute and the considerations to keep in mind for those features.

Your configuration decisions will require a tradeoff between cost and performance. The primary cost of a compute includes the Databricks Units (DBUs) consumed by the compute and the cost of the underlying resources needed to run the compute. What may not be obvious are the secondary costs such as the cost to your business of not meeting an SLA, decreased employee efficiency, or possible waste of resources because of poor controls.

Compute features

Before discussing more detailed compute configuration scenarios, it’s important to understand some features of Databricks compute and how best to use those features.

All-purpose compute and job compute

When you create a compute you select a compute type: an all-purpose compute or a job compute. All-purpose compute can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job compute. Job compute terminate when your job ends, reducing resource usage and cost.

Single-node and multi-node compute

At the top of the create compute UI, you can select whether you want your compute to be Multi Node or Single Node.

Single-node compute is intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries. Multi-node compute is for larger jobs with distributed workloads.

On-demand and spot instances

Amazon Web Services has two tiers of EC2 instances: on-demand and spot. For on-demand instances, you pay for compute capacity by the second with no long-term commitments. Spot instances allow you to use spare Amazon EC2 computing capacity and choose the maximum price you are willing to pay. Spot pricing changes in real-time based on the supply and demand on AWS compute capacity. If the current spot market price is above the max spot price, the spot instances are terminated. Since spot instances are often available at a discount compared to on-demand pricing you can significantly reduce the cost of running your applications, grow your application’s compute capacity, and increase throughput.

Databricks supports creating compute using a combination of on-demand and spot instances with a custom spot price, allowing you to tailor your compute according to your use cases. For example, this image illustrates a configuration that specifies that the driver node and four worker nodes should be launched as on-demand instances and the remaining four workers should be launched as spot instances where the maximum spot price is 100% of the on-demand price.

Configure on-demand and spot instances
Max spot price

Databricks recommends launching the compute so that the Spark driver is on an on-demand instance, which allows saving the state of the compute even after losing spot instance nodes. If you choose to use all spot instances including the driver, any cached data or tables are deleted if you lose the driver instance due to changes in the spot market.

Another important setting is Spot fall back to On-demand. If you are running a hybrid compute (that is, a mix of on-demand and spot instances), and if spot instance acquisition fails or you lose the spot instances, Databricks falls back to using on-demand instances and provides you with the desired capacity. Without this option you will lose the capacity supplied by the spot instances for the compute, causing delay or failure of your workload. Databricks recommends setting the mix of on-demand and spot instances in your compute based on the criticality of jobs, tolerance to delays and failures due to loss of instances, and cost sensitivity for each type of use case.

Tip

You can use the Amazon Spot Instance Advisor to determine a suitable price for your instance type and region.

Protect spot instances from preemption with decommissioning

While spot instances can save you money, they can be preempted by cloud provider scheduling mechanisms. The preemption of spot instances can cause issues with running jobs like shuffle fetch failures, shuffle data loss, RDD data loss, and job failure.

To help address these issues, you can enable decommissioning on your compute. For more information, see Decommission spot instances.

Autoscaling

Note

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See What is Enhanced Autoscaling?.

Autoscaling allows compute to resize automatically based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective, but it can be challenging to understand when and how to use autoscaling. The following are some considerations for determining whether to use autoscaling and how to get the most benefit:

  • Autoscaling typically reduces costs compared to a fixed-size compute.

  • Autoscaling workloads can run faster compared to an under-provisioned fixed-size compute.

  • Some workloads are not compatible with autoscaling compute, including spark-submit jobs and some Python packages.

  • With single-user all-purpose compute, users may find autoscaling is slowing down their development or analysis when the minimum number of workers is set too low. This is because the commands or queries they’re running are often several minutes apart, time in which the compute is idle and may scale down to save on costs. When the next command is executed, the compute manager will attempt to scale up, taking a few minutes while retrieving instances from the cloud provider. During this time, jobs might run with insufficient resources, slowing the time to retrieve results. While increasing the minimum number of workers helps, it also increases cost. This is another example where cost and performance need to be balanced.

  • If disk caching is being used, it’s important to remember that any cached data on a node is lost if that node is terminated. If retaining cached data is important for your workload, consider using a fixed-size compute.

  • If you have a job compute running an ETL workload, you can sometimes size your compute appropriately when tuning if you know your job is unlikely to change. However, autoscaling gives you flexibility if your data sizes increase. It’s also worth noting that optimized autoscaling can reduce expense with long-running jobs if there are long periods when the compute is underutilized or waiting on results from another process. Once again, though, your job may experience minor delays as the compute attempts to scale up appropriately. If you have tight SLAs for a job, a fixed-sized compute may be a better choice or consider using a Databricks pool to reduce compute start times.

Databricks also supports autoscaling local storage. With autoscaling local storage, Databricks monitors the amount of free disk space available on your compute’s Spark workers. If a worker begins to run low on disk, Databricks automatically attaches a new managed volume to the worker before it runs out of disk space.

Pools

Pool configuration reference reduce compute start and scale-up times by maintaining a set of available, ready-to-use instances. Databricks recommends taking advantage of pools to improve processing time while minimizing cost.

Databricks Runtime versions

Databricks recommends using the latest Databricks Runtime version for all-purpose compute. Using the most current version will ensure you have the latest optimizations and most up-to-date compatibility between your code and preloaded packages.

For job compute running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your workload before upgrading. If you have an advanced use case around machine learning, consider the specialized Databricks Runtime version.

Compute policies

Databricks compute policies allow administrators to enforce controls over the creation and configuration of compute. Databricks recommends using compute policies to help apply the recommendations discussed in this guide.

Automatic termination

Many users won’t think to terminate their compute when they’re finished using them. Fortunately, compute automatically terminate after a set period, with a default of 120 minutes.

Administrators can change this default setting when creating compute policies. Decreasing this setting can lower cost by reducing the time that compute is idle. It’s important to remember that when a compute is terminated all state is lost, including all variables, temp tables, caches, functions, objects, and so forth. All of this state will need to be restored when the compute starts again. If a developer steps out for a 30-minute lunch break, it would be wasteful to spend that same amount of time to get a notebook back to the same state as before.

Important

Idle compute continue to accumulate DBU and cloud instance charges during the inactivity period before termination.

Garbage collection

While it may be less obvious than other considerations discussed in this article, paying attention to garbage collection can help optimize job performance on your compute. Providing a large amount of RAM can help jobs perform more efficiently but can also lead to delays during garbage collection.

To minimize the impact of long garbage collection sweeps, avoid deploying compute with large amounts of RAM configured for each instance. Having more RAM allocated to the executor will lead to longer garbage collection times. Instead, configure instances with smaller RAM sizes, and deploy more instances if you need more memory for your jobs. However, there are cases where fewer nodes with more RAM are recommended, for example, workloads that require a lot of shuffles, as discussed in compute sizing considerations.

Compute access control

You can configure two types of compute permissions:

  • The Unrestricted cluster creation permission controls the ability of users to create compute.

  • Compute-level permissions control the ability to use and modify a specific compute.

To learn more about configuring compute permissions, see compute access control.

You can create a compute if you have either compute create permissions or access to a compute policy, which allows you to create any compute within the policy’s specifications. The compute creator is the owner and has CAN MANAGE permissions, which will enable them to share it with any other user within the constraints of the data access permissions of the compute.

Understanding compute permissions and compute policies are important when deciding on compute configurations for common scenarios.

Compute tags

Compute tags allow you to easily monitor the cost of cloud resources used by different groups in your organization. You can specify tags as key-value strings when creating a compute, and Databricks applies these tags to cloud resources. See Tag enforcement with policies to learn more about tag enforcement.

Compute sizing considerations

Databricks runs one executor per worker node. Therefore the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of compute size in terms of the number of workers, but there are other important factors to consider:

  • Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.

  • Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.

  • Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.

Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:

  • How much data will your workload consume?

  • What’s the computational complexity of your workload?

  • Where are you reading data from?

  • How is the data partitioned in external storage?

  • How much parallelism do you need?

Answering these questions will help you determine optimal compute configurations based on workloads. For simple ETL style workloads that use narrow transformations only (transformations where each input partition will contribute to only one output partition), focus on a compute-optimized configuration. If you expect a lot of shuffles, then the amount of memory is important, as well as storage to account for data spills. Fewer large instances can reduce network I/O when transferring data between machines during shuffle-heavy workloads.

There’s a balancing act between the number of workers and the size of worker instance types. A compute with two workers, each with 40 cores and 100 GB of RAM, has the same compute and memory as an eight worker compute with 10 cores and 25 GB of RAM.

If you expect many re-reads of the same data, then your workloads may benefit from caching. Consider a storage optimized configuration with disk cache.

Compute sizing examples

The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.

Data analysis

Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A compute with a smaller number of nodes can reduce the network and disk I/O needed to perform these shuffles. Compute A in the following diagram is likely the best choice, particularly for compute supporting a single analyst.

Cluster D will likely provide the worst performance since a larger number of nodes with less memory and storage will require more shuffling of data to complete the processing.

Data analysis compute sizing

Analytical workloads will likely require reading the same data repeatedly, so recommended worker types are storage optimized with disk cache enabled.

Additional features recommended for analytical workloads include:

  • Enable auto termination to ensure compute is terminated after a period of inactivity.

  • Consider enabling autoscaling based on the analyst’s typical workload.

  • Consider using pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations.

Features that are probably not useful:

  • Storage autoscaling, since this user will probably not produce a lot of data.

  • Shared compute, since this compute is for a single user.

Basic batch ETL

Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit from compute that are compute-optimized. For these types of workloads, any of the compute in the following diagram are likely acceptable.

Basic batch ETL compute sizing

Compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not require significant memory or storage.

Using a pool might provide a benefit for compute supporting simple ETL jobs by decreasing compute launch times and reducing total runtime when running job pipelines. However, since these types of workloads typically run as scheduled jobs where the compute runs only long enough to complete the job, using a pool might not be your best option.

The following features probably aren’t useful:

  • Disk caching, since re-reading data is not expected.

  • Auto termination probably isn’t required since these are likely scheduled jobs.

  • Autoscaling is not recommended since compute and storage should be pre-configured for the use case.

  • Shared compute is intended for multi-users and won’t benefit a compute running a single job.

Complex batch ETL

More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a compute will help minimize shuffles, you should consider a smaller compute like Cluster A in the following diagram over a larger compute like Cluster D.

Complex ETL compute sizing

Complex transformations can be compute-intensive, so for some workloads reaching an optimal number of cores may require adding additional nodes to the compute.

Like simple ETL jobs, compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not require significant memory or storage. Also, like simple ETL jobs, the main compute feature to consider is pools to decrease compute launch times and reduce total runtime when running job pipelines.

The following features probably aren’t useful:

  • Disk caching, since re-reading data is not expected.

  • Auto termination probably isn’t required since these are likely scheduled jobs.

  • Autoscaling is not recommended since compute and storage should be pre-configured for the use case.

  • Shared compute is intended for multi-users and won’t benefit a compute running a single job.

Training machine learning models

Since initial iterations of training a machine learning model are often experimental, a smaller compute such as Cluster A is a good choice. A smaller compute will also reduce the impact of shuffles.

If stability is a concern, or for more advanced stages, a larger compute such as Cluster B or C may be a good choice.

A large compute such as Cluster D is not recommended due to the overhead of shuffling data between nodes.

Machine learning compute sizing

Recommended worker types are storage optimized with disk caching enabled to account for repeated reads of the same data and to enable caching of training data. If the compute and storage options provided by storage optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of disk caching support with these nodes.

Additional features recommended for machine learning workloads include:

  • Enable auto termination to ensure compute is terminated after a period of inactivity.

  • Use pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations.

Features that are probably not useful:

  • Autoscaling, since cached data can be lost when nodes are removed as a compute scales down. Additionally, typical machine learning jobs will often consume all available nodes, in which case autoscaling will provide no benefit.

  • Storage autoscaling, since this user will probably not produce a lot of data.

  • Shared compute, since this compute is for a single user.

Common scenarios

The following sections provide additional recommendations for configuring compute for common compute usage patterns:

  • Multiple users running data analysis and ad-hoc processing.

  • Specialized use cases like machine learning.

  • Support scheduled batch jobs.

Multi-user compute

Scenario

You need to provide multiple users access to data for running data analysis and ad-hoc queries. Compute usage might fluctuate over time, and most jobs are not very resource-intensive. The users mostly require read-only access to the data and want to perform analyses or create dashboards through a simple user interface.

The recommended approach for compute provisioning is a hybrid approach for node provisioning in the compute along with autoscaling. A hybrid approach involves defining the number of on-demand instances and spot instances for the compute and enabling autoscaling between the minimum and the maximum number of instances.

Multi-user scenario

This compute is always available and shared by the users belonging to a group by default. Enabling autoscaling allows the compute to scale up and down depending upon the load.

Users do not have access to start/stop the compute, but the initial on-demand instances are immediately available to respond to user queries. If the user query requires more capacity, autoscaling automatically provisions more nodes (mostly Spot instances) to accommodate the workload.

Databricks has other features to further improve multi-tenancy use cases:

This approach keeps the overall cost down by:

  • Using a shared compute model.

  • Using a mix of on-demand and spot instances.

  • Using autoscaling to avoid paying for underutilized compute.

Specialized workloads

Scenario

You need to provide compute for specialized use cases or teams within your organization, for example, data scientists running complex data exploration and machine learning algorithms. A typical pattern is that a user needs a compute for a short period to run their analysis.

The best approach for this kind of workload is to create compute policies with pre-defined configurations for default, fixed, and settings ranges. These settings might include the number of instances, instance types, spot versus on-demand instances, roles, libraries to be installed, and so forth. Using compute policies allows users with more advanced requirements to quickly spin up compute that they can configure as needed for their use case and enforce cost and compliance with policies.

Specialized workloads

This approach provides more control to users while maintaining the ability to keep cost under control by pre-defining compute configurations. This also allows you to configure compute for different groups of users with permissions to access different data sets.

One downside to this approach is that users have to work with administrators for any changes to compute, such as configuration, installed libraries, and so forth.

Batch workloads

Scenario

You need to provide compute for scheduled batch jobs, such as production ETL jobs that perform data preparation. The suggested best practice is to launch a new compute for each job run. Running each job on a new compute helps avoid failures and missed SLAs caused by other workloads running on a shared compute. Depending on the level of criticality for the job, you could use all on-demand instances to meet SLAs or balance between spot and on-demand instances for cost savings.

Scheduled batch workloads