This article covers best practices supporting principles of cost optimization, organized by principle.
Delta Lake comes with many performance improvements that can significantly speed up a workload (compared to using Parquet, ORC, and JSON). See Optimization recommendations on Databricks. If the workload also runs on a job cluster, this directly leads to a shorter runtime of the cluster and lower costs.
A job is a way to run non-interactive code in a Databricks cluster. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job clusters, the non-interactive workloads will cost significantly less than on all-purpose clusters. See the pricing overview to compare “Jobs Compute” and “All-Purpose Compute”.
An additional advantage is that every job or workflow runs on a new cluster, isolating workloads from one another.
Multitask workflows can reuse compute resources for all tasks, so that the cluster startup time only appears once per workflow. See Use Databricks compute with your jobs.
The Databricks platform provides different runtimes that are optimized for data engineering tasks (Databricks Runtime) or for Machine Learning (Databricks Runtime for Machine Learning). The runtimes are built to provide the best selection of libraries for the tasks and ensures that all provided libraries are up-to-date and work together optimally. Databricks Runtime is released on a regular cadence and offers performance improvements between major releases. These improvements in performance often lead to cost savings due to more efficient usage of cluster resources.
Virtual machines with GPUs can dramatically speed up computational processes for deep learning, but have a significantly higher price than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries.
Most workloads do not use GPU-accelerated libraries do not benefit from GPU-enabled instances. Workspace admins can restrict GPU machines and clusters to prevent unnecessary use. See the blog post “Are GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters”.
Spot instances use cloud virtual machine excess resources that are available at a cheaper price. To save cost, Databricks supports creating clusters using spot instances. It is recommended to always have the first instance (Spark driver) as an on-demand virtual machine. Spot instances are a great selection for workloads when it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.
Autoscaling allows your workloads to use the right amount of compute required to complete your jobs.
Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See What is Enhanced Autoscaling?.
Enable autoscaling for batch workloads.
Enable autoscaling for SQL warehouse.
Use Delta Live Tables Enhanced Autoscaling.
Databricks provides a number of features to help control costs by reducing idle resources and controlling when compute resources can be deployed.
Configure auto termination for all interactive clusters. After a specified idle time, the cluster shuts down. See Automatic termination.
For use cases where clusters are only needed during business hours, the clusters can be configured with auto termination, and a scheduled process can restart the cluster (and potentially prewarm data if required) in the morning before users are back at their desktops. See CACHE SELECT.
If a starting time that is significantly shorter than a full cluster start would be acceptable, consider using cluster pools. See Best practices: pools. Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. If the pool has no idle instances, the pool expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use that pool’s idle instances.
Databricks does not charge DBUs while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply.
Cluster policies can enforce many cost specific restrictions for clusters. See Operational Excellence - Use cluster policies. For example:
Enable cluster autoscaling with a set minimum number of worker nodes.
Enable cluster auto termination with a reasonable value (for example, 1 hour) to avoid paying for idle times.
Ensure that only cost-efficient VM instances can be selected. Follow the best practices for cluster configuration (see Best practices: Cluster configuration).
Apply a spot instance strategy.
The account console allows viewing the billable usage. As a Databricks account owner or account admin, you can also use the account console to download billable usage logs. To access this data programmatically, you can also use the Account API to download the logs. Alternatively, you can configure daily delivery of billable usage logs in CSV file format to an AWS S3 storage bucket.
As a best practice, the full costs (including VMs, storage, and network infrastructure) should be monitored. This can be achieved by cloud provider cost management tools or by adding third party tools.
Photon provides extremely fast query performance at low cost – from data ingestion, ETL, streaming, data science and interactive queries – directly on your data lake. Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on – no code changes and no lock-in. Compared to Apache Spark, Photon provides an additional 2x speedup as measured by the TPC-DS 1TB benchmark. Customers have observed 3x–8x speedups on average, based on their workloads, compared to the latest DBR versions.
From a cost perspective, Photon workloads use about 2x–3x more DBUs per hour than Spark workloads. Given the observed speedup, this could lead to significant cost savings, and jobs that run regularly should be evaluated whether they are not only faster but also cheaper with Photon.
BI workloads typically use data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard, write a query, or simply analyze query results without interacting further with the platform. This example demonstrates two requirements:
Terminate clusters during idle periods to save costs.
Have compute resources available quickly (for both start-up and scale-up) to satisfy user queries when they request new or updated data with the BI tool.
Non-serverless Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both immediate availability and termination during idle times can be achieved. This results in a great user experience and overall cost savings.
Additionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, resulting lower costs.
To monitor cost and accurately attribute Databricks usage to your organization’s business units and teams (for example, for chargebacks), you can tag clusters and pools. These tags propagate to detailed DBU usage reports and to cloud provider VMs and blob storage instances for cost analysis.
Ensure that cost control and attribution are already in mind when setting up workspaces and clusters for teams and use cases. This streamlines tagging and improves the accuracy of cost attributions.
For the overall costs, DBU virtual machine, disk, and any associated network costs must be considered. For serverless SQL warehouses this is simpler since the DBU costs already include virtual machine and disk costs.
Traditionally, when people think about streaming, terms such as “real-time,” “24/7,” or “always on” come to mind. If data ingestion happens in “real-time”, the underlying cluster needs to run 24/7, producing consumption costs every single hour of the day.
However, not every use case that is based on a continuous stream of events needs these events to be added to the analytics data set immediately. If the business requirement for the use case only needs fresh data every few hours or every day, then this requirement can be achieved with only several runs a day, leading to a significant cost reduction for the workload. Databricks recommends using Structured Streaming with trigger
AvailableNow for incremental workloads that do not have low latency requirements. See Configuring incremental batch processing.
Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:
Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a cluster.
Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.
Additional considerations include worker instance type and size, which also influence the preceding factors. When sizing your cluster, consider the following:
How much data will your workload consume?
What’s the computational complexity of your workload?
Where are you reading data from?
How is the data partitioned in external storage?
How much parallelism do you need?
Details and examples can be found under Cluster sizing considerations.