Best practices for cost optimization

This article covers best practices supporting principles of cost optimization, organized by principle.

1. Choose optimal resources

Use performance optimized data formats

To get the most out of the Databricks Data Intelligence Platform, you must use Delta Lake as your storage framework. It helps build simpler and more reliable ETL pipelines, and comes with many performance enhancements that can significantly speed up workloads compared to using Parquet, ORC, and JSON. See Optimization recommendations on Databricks. If the workload is also running on a job compute, this directly translates into shorter uptime of compute resources leading to lower costs.

Use job compute

A job is a way to run non-interactive code on a Databricks compute instance. For example, you can run an extract, transform, and load (ETL) workload interactively or on a schedule. Of course, you can also run jobs interactively in the notebook UI. However, on job compute, the non-interactive workloads will cost significantly less than on all-purpose compute. See the pricing overview to compare Jobs Compute and All-Purpose Compute.

An additional benefit for some jobs is that each job or workflow can run on a new compute instance, isolating workloads from each other. However, multitask workflows can also reuse compute resources for all tasks, so the compute startup time occurs only once per workflow. See Configure compute for jobs.

Use SQL warehouse for SQL workloads

For interactive SQL workloads, a Databricks SQL warehouse is the most cost-efficient engine. See the pricing overview. All SQL warehouses come with Photon by default, which accelerates your existing SQL and DataFrame API calls and reduces your overall cost per workload.

In addition, serverless SQL warehouses support intelligent workload management (IWM), a set of features that enhances Databricks SQL serverless ability to process large numbers of queries quickly and cost-effectively.

Use up-to-date runtimes for your workloads

The Databricks platform provides different runtimes that are optimized for data engineering tasks (Databricks Runtime) or machine learning tasks (Databricks Runtime for Machine Learning). The runtimes are built to provide the best selection of libraries for the tasks, and to ensure that all libraries provided are up-to-date and work together optimally. The Databricks Runtimes are released on a regular cadence, providing performance improvements between major releases. These performance improvements often result in cost savings due to more efficient use of compute resources.

Only use GPUs for the right workloads

Virtual machines with GPUs can dramatically speed up computations for deep learning, but are significantly more expensive than CPU-only machines. Use GPU instances only for workloads that have GPU-accelerated libraries.

Most workloads do not use GPU-accelerated libraries, so they do not benefit from GPU-enabled instances. Workspace administrators can restrict GPU machines and compute resources to prevent unnecessary usage. See the blog post “Are GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters”.

Use serverless services for your workloads

BI use cases

BI workloads typically consume data in bursts and generate multiple concurrent queries. For example, someone using a BI tool might update a dashboard or write a query and then simply analyze the results without further interaction with the platform. In this scenario the data platform:

Terminates idle compute resources to save costs.
Quickly provides the compute resources when the user requests new or updated data with the BI tool.

Non-serverless Databricks SQL warehouses have a startup time of minutes, so many users tend to accept the higher cost and do not terminate them during idle periods. On the other hand, serverless SQL warehouses start and scale up in seconds, so both instant availability and idle termination can be achieved. This results in a great user experience and overall cost savings.

Additionally, serverless SQL warehouses scale down earlier than non-serverless warehouses, again, resulting in lower costs.

ML and AI model serving

Most models are served as a REST API for integration into your web or client application; the model serving service receives varying loads of requests over time, and a model serving platform should always provide sufficient resources, but only as many as are actually needed (upscaling and downscaling).

Mosaic AI Model Serving uses serverless compute and provides a highly available and low latency service for deploying models. The service automatically scales up or down to meet changes in demand, reducing infrastructure costs while optimizing latency performance.

Use the right instance type

Using the latest generation of cloud instance types almost always provides performance benefits, as they offer the best performance and the latest features.

For example, Graviton2-based Amazon EC2 instances can deliver a significantly better price-performance than comparable Amazon EC2 instances.

Based on your workloads, it is also important to choose the right instance family to get the best performance/price ratio. Some simple rules of thumb are:

Memory optimized for ML, heavy shuffle and spill workloads
Compute optimized for structured streaming workloads and maintenance jobs (such as optimize and vacuum)
Storage optimized for workloads that benefit from caching, such as ad-hoc and interactive data analysis
GPU optimized for specific ML and DL workloads
General purpose in the absence of specific requirements

Choose the most efficient compute size

Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:

Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute instance.
Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.

Additional considerations include worker instance type and size, which also influence the preceding factors. When sizing your compute, consider the following:

How much data will your workload consume?
What’s the computational complexity of your workload?
Where are you reading data from?
How is the data partitioned in external storage?
How much parallelism do you need?

Details and examples can be found under Compute sizing considerations.

Evaluate performance-optimized query engines

Photon is a high-performance Databricks-native vectorized query engine that speeds up your SQL workloads and DataFrame API calls (for data ingestion, ETL, streaming, data science, and interactive queries). Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on – no code changes and no lock-in.

The observed speedup can lead to significant cost savings, and jobs that run regularly should be evaluated to see whether they are not only faster but also cheaper with Photon.

2. Dynamically allocate resources

Use auto-scaling compute

With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally intensive than others, and Databricks automatically adds additional workers during those phases of your job (and removes them when they’re no longer needed). Autoscaling can reduce overall costs compared to a statically sized compute instance.

Compute auto-scaling has limitations when scaling down cluster size for structured streaming workloads. Databricks recommends using DLT with enhanced autoscaling for streaming workloads.

Use auto termination

Databricks provides several features to help control costs by reducing idle resources and controlling when compute resources can be deployed.

Configure auto termination for all interactive compute resources. After a specified idle time, the compute resource shuts down. See Automatic termination.
For use cases where compute is needed only during business hours, compute resources can be configured with auto termination, and a scheduled process can restart compute (and possibly prewarm data if necessary) in the morning before users are back at their desktops. See CACHE SELECT.
If compute startup times are too long, consider using cluster pools, see Pool best practices. Databricks pools are a set of idle, ready-to-use instances. When cluster nodes are created using the idle instances, cluster start and auto-scaling times are reduced. If the pools have no idle instances, the pools expand by allocating a new instance from the instance provider in order to accommodate the cluster’s request.

Databricks does not charge Databricks Units (DBUs) while instances are idle in the pool, resulting in cost savings. Instance provider billing does apply.

Use compute policies to control costs

Compute policies can enforce many cost-specific restrictions for compute resources. See Operational Excellence - Use compute policies. For example:

Enable cluster autoscaling with a set minimum number of worker nodes.
Enable cluster auto termination with a reasonable value (for example, 1 hour) to avoid paying for idle times.
Ensure that only cost-efficient VM instances can be selected. Follow the best practices for cluster configuration. See Compute configuration recommendations.
Apply a spot instance strategy.

3. Monitor and control cost

Cost management in Databricks is a critical aspect of optimizing cloud spending while maintaining performance. The process can be broken down into three key areas:

Setup
Monitoring
Management

The following best practices cover these three areas.

Setup tagging for cost attribution

To monitor costs in general and to accurately attribute Databricks usage to your organization’s business units and teams (for example, for chargebacks in your organization), you can tag workspaces, clusters, SQL warehouses, and pools.

In the setup phase, organizations should implement effective tagging practices. This involves creating tag naming conventions across the organization. It is important to use both general tags that attribute usage to specific user groups and more granular tags that provide highly specific insights, for example based on roles, products, or services.

Start tagging from the very beginning of using Databricks. In addition to the default tags set by Databricks, as a minimum, set up the custom tags _Business Units_ and _Projects_ and populate them for your specific organization. If you need to differentiate between development, quality assurance, and production costs, consider adding the tag Environment to workspaces and compute resources.

The tags propagate both to usage logs and to cloud provider resources for cost analysis. Total costs include Databricks Units (DBUs) plus virtual machine, disk, and associated network costs. Note that for serverless services, the DBU cost already includes the virtual machine costs.

Since adding tags only affects future usage, it is better to start with a more detailed tagging structure. It is always possible to ignore tags if practical use over time shows that they have no impact on cost understanding and attribution. But missing tags can't be added to past events.

Set up budgets and alerts to enable monitoring of account spending

Budgets allow you to monitor usage across your entire account. They provide a way to set financial targets and allow you to track either account-wide spending or apply filters to track the spending of specific teams, projects or workspaces. If your account uses serverless compute, be sure to use budget policies to attribute your account’s serverless usage. See Attribute serverless usage with budget policies.

It is recommended that you set up email notifications when the monthly budget is reached to avoid unexpected overspends.

Monitor costs to align spending with expectations

Cost observability dashboards help to visualize spending patterns, and budget policies help attribute serverless compute usage to specific users, groups, or projects, enabling more accurate cost allocation. To stay on top of spending, Databricks offers a range of tools and features to track and analyze costs:

Monitor usage in the account console: Databricks offers cost management AI/BI dashboards in the account console, that can be imported by account admins to any Unity Catalog-enabled workspace in their account. This allows you to monitor either the account usage or a single workspace usage.
Use budgets to monitor account spending: Budgets enable you to monitor usage across your account.
Budget policies can be used to attribute serverless usage by applying tags to any serverless compute activity incurred by a user assigned to the policy.
Monitor and manage Delta Sharing egress costs: Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. See Monitor and manage Delta Sharing egress costs (for providers) to monitor and manage egress charges.
Monitor costs using system tables: The system table system.billing.usage allows to monitor costs. Custom tags applied to workspaces and compute resources are propagated to this system table. You can monitor the costs of serverless compute, job costs, and model serving costs.

Download billable usage for local analysis: You can use the Account REST API to download the billable usage logs in CSV format for the specified account and date range.

Manage costs to align usage with organizational needs

Cost management goes beyond technical implementation to include broader organisational strategies:

Develop and schedule a housekeeping job to (incrementally) apply or clean up tags. The job needs to be resilient to not get interrupted by issues of single resources. All changes should be written to an audit log.
Conduct regular cost audits to review all active resources, their spend and their alignment with organisational needs. Sharing monthly cost reports helps track consumption increases and anomalies, and encourages proactive cost management across all teams.
Optimize resource allocation through strategies such as autoscaling and auto-termination, which dynamically allocate resources based on workload requirements; see the other best practices in this chapter.
Educate teams on the cost implications of their resource usage and trained on best practices for cost optimisation.
Use compute policies as a tool to control the type and size of compute resources certain users can create and access.

Overall, cost optimisation needs to be seen as an ongoing process and strategies need to be revisited regularly in the event of scaling, new projects or unexpected cost spikes. Use both Databricks' native cost management capabilities and third-party tools for comprehensive control and optimisation.

4. Design cost-effective workloads

Balance always-on and triggered streaming

Traditionally, when people think about streaming, terms such as “real-time”, “24/7,” or “always on” come to mind. If data ingestion happens in real-time, the underlying compute resources must run 24/7, incurring costs every single hour of the day.

However, not every use case that relies on a continuous stream of events requires those events to be immediately added to the analytics data set. If the business requirement for the use case only requires fresh data every few hours or every day, then that requirement can be met with only a few runs per day, resulting in a significant reduction in workload cost. Databricks recommends using Structured Streaming with the AvailableNow trigger for incremental workloads that do not have low latency requirements. See Configuring incremental batch processing.

Balance between on-demand and capacity excess instances

Spot instances take advantage of excess virtual machine resources in the cloud that are available at a lower price. To save costs, Databricks supports creating clusters using spot instances. It is recommended that the first instance (the Spark driver) should always be an on-demand virtual machine. Spot instances are a good choice for workloads where it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.

Also, consider using the Fleet instance type. If a cluster uses one of these fleet instance types, Databricks selects the matching physical AWS instance types with the best price and availability to use in your cluster.

1. Choose optimal resources​

Use performance optimized data formats​

Use job compute​

Use SQL warehouse for SQL workloads​

Use up-to-date runtimes for your workloads​

Only use GPUs for the right workloads​

Use serverless services for your workloads​

Use the right instance type​

Choose the most efficient compute size​

Evaluate performance-optimized query engines​

2. Dynamically allocate resources​

Use auto-scaling compute​

Use auto termination​

Use compute policies to control costs​

3. Monitor and control cost​

Setup tagging for cost attribution​

Set up budgets and alerts to enable monitoring of account spending​

Monitor costs to align spending with expectations​

Manage costs to align usage with organizational needs​

4. Design cost-effective workloads​

Balance always-on and triggered streaming​

Balance between on-demand and capacity excess instances​