This article describes suggested best practices under various scenarios for Databricks cluster usage and allocation on AWS cloud infrastructure. The practices suggested here balance usability and cost management.
This section introduces cluster usage terminology and Databricks features for managing cluster usage.
On most of the cloud providers, one instance running for one hour is an instance hour.
Databricks charges for usage based on Databricks Unit (DBU), a unit of processing capability per hour.
One DBU is the processing capability of one hour of
r3.xlarge (memory optimized) or one
c3.2xlarge (compute optimized) AWS instance.
More details can be found at Databricks Pricing.
For example, if you create a Databricks cluster with one driver node
and 3 worker nodes of type
r3.xlarge and run the cluster for 2 hours,
you compute the DBU as follows:
- Total instance hour = total number of nodes (1 + 3) * number of hours (2) = 8
- Total charge = AWS cost for 8 instance hours of
r3.xlarge+ 8 DBU cost
Databricks charges up to the second for billing and charges fractional DBU costs.
Amazon has two tiers of EC2 instances: on-demand and spot. For on-demand instances you pay for compute capacity by the second with no long-term commitments. Spot instances allow you to use spare Amazon EC2 computing capacity and choose the maximum price you are willing to pay. Spot pricing changes in real-time based on the supply and demand on AWS compute capacity. If the current spot market price is above the max spot price, the spot instances are terminated. Since spot instances are often available at a discount compared to on-demand pricing, for the same budget you can significantly reduce the cost of running your applications, grow your application’s compute capacity, and increase throughput.
Databricks supports creating clusters using a combination of on-demand and spot instances (with custom spot price) allowing you to tailor your cluster according to your use cases.
For example, the preceding configurations specify that the driver node and 4 worker nodes should be launched as on-demand instances and the remaining 4 workers should be launched as spot instances (where the maximum spot price is 100% of the on-demand price).
We recommend launching the cluster so that the Spark driver is on an on-demand instance, which allows saving the state of the cluster even after losing spot instance nodes. If you choose to use all spot instances (including the driver), any cached data or table will be deleted when you lose the driver instance due to changes in the spot market.
Another important setting to note is the option Spot fall back to On-demand. If you are running a hybrid cluster (that is, a mix of on-demand and spot instances), and if spot instance acquisition fails or you lose the spot instances, Databricks falls back to using on-demand instances and provides you with the desired capacity. Without this option, you will lose the capacity provided by the spot instances for the cluster causing delay or failure of your workload. We recommend setting the mix of on-demand and spot instances in your cluster based on the criticality, tolerance to delays and failures due to loss of instances, and cost sensitivity for each type of use case.
You can use the Amazon Spot Instance Advisor to determine suitable price for your instance type and region.
Autoscaling automatically adds and removes worker nodes in response to changing workloads to optimize resource usage. With autoscaling enabled, Databricks automatically chooses the appropriate number of workers required to run your Spark job. Autoscaling makes it easier to achieve high cluster utilization as you do not need to worry about the exact provisioning of cluster to match workloads. This can offer two advantages:
- The workloads can run faster compared to running a constant sized under-provisioned cluster.
- You can reduce overall costs compared to a statically sized cluster.
When you enable autoscaling, you can specify the minimum and maximum number of nodes in the cluster.
The preceding configuration enables the autoscaling feature with the cluster varying from 8-20 nodes, of which 5 (including the driver node) are on-demand and the rest spot instances.
The Start cluster feature allows restarting previously terminated clusters while retaining their original configuration (cluster ID, number of instances, type of instances, spot versus on-demand mix, instance profile, libraries to be installed, and so on). You can restart a cluster:
In the terminated cluster list
In the cluster detail page
Auto termination lets you define and use idle conditions (say idle for three hours) to terminate a cluster for cost savings.
This section delves deeper into specific scenarios and establishes the best approach to finding a balance between usability and cost effectiveness.
This table list the most common scenarios for cluster configuration within Databricks. Depending on your use case and the users using Databricks, your configuration may vary slightly. In general, data scientists tend to be more comfortable managing their own clusters than data analysts. If admins prefer to impose stricter limits around cluster size, they can choose to provision clusters for data scientists. Furthermore, data engineers are most likely to launch jobs via REST APIs or through the UI. If they launch jobs through the UI, we recommend using the New Cluster option, especially for production level jobs.
|Scenario||Use Case||Cluster Recommendation||Cluster Managed By?||Intended Audience|
|Scenario 1||Generic usage across organization||Shared autoscaling cluster||Admin||Data analyst|
|Scenario 2||Ad-hoc usage with specific use cases||Individual clusters||Admin or user||Data scientist|
|Scenario 3||Scheduled batch workload||Launch new cluster via job||User||Data engineer|
|Scenario 4||Cloning clusters||Launch new cluster via job||Admin||Administrator|
Suppose you need to provide a large group of users access to data for running ad-hoc queries (mostly SQL based). The cluster usage varies a lot between day to day and very few jobs are super intensive. The users mostly have read-only access to the data and most likely want to create dashboards from the different datasets through an easy and intuitive notebook interface.
The best approach for cluster provisioning under these circumstances is to use a hybrid approach for node provisioning in the cluster along with autoscaling. A hybrid approach involves defining the number of on-demand instances and spot instances to make up the cluster and then enabling autoscaling between the min and max number of instances.This cluster is always (24/7) on, shared by the users belonging to the group by default and scales up/down depending upon the load. The users do not have access to start/stop the cluster.
The initial on-demand instances are around to immediately respond to user queries for better usability. If the user query requires more capacity, autoscaling kicks in and provisions more nodes (mostly spot instances) to deal with the workload.
Databricks has other features to further improve this use case around multi-tenancy:
- Handling large queries in interactive workflows - Automatically manages queries that will never finish.
- Task preemption for high concurrency - Improves how long running jobs and shorter jobs work together.
- Autoscaling local storage - Prevents running out of storage space(shuffle) in a multi-tenant environment.
This approach keeps the overall cost down by:
- Using a shared cluster model.
- Using a mix of on-demand and spot instances.
- Using autoscaling instead of a fixed-size cluster and avoid paying for underutilized cluster time.
Scenario 2: Specialized use case or user groups within the organization (data scientists running explorations)
This scenario is for specialized use cases and groups within the organization. For example, data scientists running intensive exploration and machine learning algorithms requiring special libraries to be installed on the cluster, and so on.
A typical user will run some intensive data operations for a short period of time and then want to get rid of the cluster.
The best approach for this kind of workload is to have the Databricks admin create a cluster with pre-defined configuration (number of instances, type of instances, spot versus on-demand mix, instance profile, libraries to be installed, and so on) but allowing the users to start and stop the cluster using the Start Cluster feature. As an additional cost savings, the administrator can also enable the Auto Termination feature for these clusters based on some idle conditions (for example, a time based condition would be terminate the cluster if idle for more than an hour).
This approach provides more control to users in terms of spinning up the clusters, but still provides the ability to keep the cost under control by pre-defining the configurations. For example, the administrator could decide to have 100% spot instance cluster for data exploration use cases with auto termination enabled. This approach also allows the admin to configure different clusters for different groups of users with separate access permissions to different set of data using instance profiles and AWS keys.
One downside to this approach is that users must involve the administrator for any changes to configuration, libraries, and so on, to the clusters.
This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.
The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so on) due to an existing workload (noisy neighbor) on a shared cluster. Depending on the level of criticality for the job you could go full on-demand (to meet SLAs) or even balance between spot and on-demand instances (with Spot fall back to On-demand option enabled for the cluster) for some cost savings.
For non-critical example dashboard type jobs, you could use a shared cluster instead of provisioning a new cluster for each run.
The ability to clone clusters provides a convenient way for administrators to create a duplicate of an existing cluster retaining the same configurations, libraries, and so on. This allows the administrators to quickly provision identical clusters for each user or user group with similar configuration requirements. This feature provides a pseudo-templating capability and over time makes maintaining the configurations for different user and groups more convenient.
- Amazon EC2 On-Demand Pricing
- Databricks Serverless: Next Generation Resource Management for Apache Spark
- Query Watchdog: Handling Disruptive Queries in Spark SQL
- Running Apache Spark Clusters with Spot Instances in Databricks
- Persistent Clusters: Simplifying Cluster Management for Analytics
- Best Practices for Coarse Grained Data Security in Databricks