Compute configuration recommendations
This article includes recommendations and best practices related to compute configuration.
If your workload is supported, Databricks recommends using serverless compute rather than configuring your own compute resource. Serverless compute is the simplest and most reliable compute option. It requires no configuration, is always available, and scales according to your workload. Serverless compute is a compute option for notebooks, jobs, and Lakeflow Declarative Pipelines. See Connect to serverless compute.
Additionally, data analysts can use serverless SQL warehouses to query and explore data on Databricks. See What are Serverless SQL warehouses?.
Select an appropriate access mode
Classic all-purpose and jobs compute have an access mode setting that determines who can attach to and use the compute resource. In Unity Catalog, the compute must use either standard or dedicated access mode.
Standard compute can be shared by multiple users and groups while still enforcing user isolation and all user- and group-level data access permissions. This makes it an easier-to-manage, cost-effective option for most workloads, especially ones that enforce fine-grained access control.
Dedicated compute is recommended if you need access to features not available on standard compute, such as RDD APIs, GPU instances, R, or Databricks Container Service. For more information, see Standard compute requirements and limitations.
Use compute policies
If you are creating new compute from scratch, Databricks recommends using compute policies. Compute policies let you create preconfigured compute resources designed for specific purposes, such as personal compute, shared compute, power users, and jobs. Policies limit the decisions you need to make when configuring compute settings.
If you don't have access to policies, contact your workspace admin. See Default policies and policy families.
Assess whether you would benefit from Photon
Many workloads benefit from Photon, but it is most beneficial for SQL workloads and DataFrame operations involving complex transformations, such as joins, aggregations, and data scans on large tables. Workloads with frequent disk access, wide tables, or repeated data processing also see improved performance.
Simple batch ETL jobs that do not involve wide transformations or large data volumes may see minimal impact from enabling Photon, especially if queries typically complete in under two seconds.
Compute sizing considerations
The following recommendations assume that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.
People often think of compute size in terms of the number of workers, but there are other important factors to consider:
- Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.
- Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
- Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.
Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:
- How much data will your workload consume?
- What's the computational complexity of your workload?
- Where are you reading data from?
- How is the data partitioned in external storage?
- How much parallelism do you need?
Answering these questions will help you determine optimal compute configurations based on workloads.
There's a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 16 cores and 128 GB of RAM, has the same compute and memory as configuring compute with 8 workers, each with 4 cores and 32 GB of RAM.
Compute configuration examples
The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.
Data analysis
Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A compute resource with a smaller number of larger nodes can reduce the network and disk I/O needed to perform these shuffles.
A single-node compute with a large VM type is likely the best choice, particularly for a single analyst.
Analytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled or instances with local storage.
Additional features recommended for analytical workloads include:
- Enable auto termination to ensure compute is terminated after a period of inactivity.
- Consider enabling autoscaling based on the analyst's typical workload.
Basic batch ETL
For simple batch ETL jobs that don't require wide transformations, such as joins or aggregations, use instances with lower requirements for memory and storage. This might result in cost savings over other worker types.
Complex batch ETL
For a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends using fewer workers to reduce the amount of data shuffled. To compensate for having fewer workers, increase the size of your instances.
Complex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, increase the amount of memory available on your instances.
Optionally, use pools to decrease compute launch times and reduce total runtime when running job pipelines.
Training machine learning models
To train machine learning models, Databricks recommends creating a compute resource using the Personal compute policy.
You should use a single node compute with a large node type for initial experimentation with training machine learning models. Having fewer nodes reduces the impact of shuffles.
Adding more workers can help with stability, but you should avoid adding too many workers because of the overhead of shuffling data.
Recommended worker types are storage optimized with disk caching enabled, or an instance with local storage to account for repeated reads of the same data and to enable caching of training data.
Additional features recommended for machine learning workloads include:
- Enable auto termination to ensure compute is terminated after a period of inactivity.
- Use pools, which will allow restricting compute to pre-approved instance type.
- Ensure consistent compute configurations using policies.