Classic compute configuration best practices
This page outlines best practices for configuring classic compute resources. For most new workloads, Databricks recommends using serverless compute, which requires no configuration. If your workload isn't supported on serverless compute (see Serverless limitations), use the following best practices to configure a classic compute resource.
Structured Streaming workflows have specific configuration recommendations. See Production considerations for Structured Streaming.
Access mode
Classic compute resources can be assigned to either standard or dedicated access mode, which determines who can attach to and use the compute resource.
Databricks recommends using standard access mode for most workloads. Standard compute can be shared by multiple users and groups while enforcing user isolation and all data access permissions. This makes it an easier-to-manage, cost-effective option for most workloads.
Only use dedicated access mode if your workload has specific standard compute limitations, such as ML Runtime on GPU, RDD APIs, or R. For more information, see Standard compute requirements and limitations.
If Unity Catalog is enabled, do not set spark.databricks.passthrough.enabled. Credential passthrough is a legacy access mode that is not compatible with Unity Catalog.
See Access modes.
Databricks Runtime version
Use the latest long-term support (LTS) Databricks Runtime version. LTS versions receive extended security patches and bug fixes, ensuring your workloads stay stable and compatible with the latest platform features.
Only select a machine learning runtime if your workload uses GPUs, distributed ML training, or AutoML. Databricks Runtime for ML installs a large set of libraries that can conflict with your own dependencies if not needed, causing errors or silent correctness issues. See Train AI and ML models.
Configuration hygiene
These practices keep your compute configurations clean and your workloads portable.
Avoid using init scripts
Init scripts can introduce unexpected behaviors, including library conflicts that break workloads and make environments less predictable. Instead, add libraries to your compute policies, use %pip install in notebooks, or define dependencies in an environment spec. See Add libraries to a policy.
Avoid hardcoding Spark configurations
Avoid hardcoding Spark configurations (such as spark.executor.memory or spark.dynamicAllocation.*) in compute or job definitions. Hardcoded values override the built-in optimizations that Databricks provides, often leading to wasted spend or degraded performance. Use notebook-scoped session configs only when you have a specific reason to override a default.
Avoid compute-local storage paths
Do not store data on compute-local paths, which do not persist beyond the compute's lifecycle. Instead, use Unity Catalog volumes or temporary storage. See What are volumes?.
Avoid DBFS mounts
DBFS mounts lack proper access control lists (ACLs). Instead, use Unity Catalog volumes or workspace file systems (WSFS). See What are volumes?.
Avoid installing compute-scoped libraries
Installing libraries at the compute level creates environment drift across jobs. Instead, use %pip install in notebooks or define dependencies in an environment spec. This also makes classic workloads easier to migrate to serverless.
Performance
Assess whether you would benefit from Photon
Many workloads benefit from Photon, but it is most beneficial for SQL workloads and DataFrame operations involving complex transformations, such as joins, aggregations, and data scans on large tables. Workloads with frequent disk access, wide tables, or repeated data processing also see improved performance.
Simple batch ETL jobs that do not involve wide transformations or large data volumes may see minimal impact from enabling Photon, especially if queries typically complete in under two seconds.
Use autoscaling
Configure autoscaling so that long-running tasks can dynamically add and remove worker nodes during job runs. See Enable autoscaling.
Use instance pools to reduce start times
Instance pools reserve compute resources from your cloud provider. Pools decrease new cluster start time and ensure compute resource availability. See Pool configuration reference.
Cost optimization
Use compute policies
Databricks recommends using compute policies. Compute policies let you create preconfigured compute resources designed for specific purposes, such as personal compute, shared compute, power users, and jobs. Policies limit the decisions you need to make when configuring compute settings.
If you don't have access to policies, contact your workspace admin. See Default policies and policy families.
Use spot instances
Configure spot instances for workloads that have lax latency requirements to optimize costs. See Spot instances.
Configure availability zones
Specify an availability zone (AZ) if your organization has purchased reserved instances, or use Auto-AZ to retry in other availability zones if AWS returns insufficient capacity errors. See Availability zones.
Compute sizing considerations
The following recommendations assume that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.
People often think of compute size in terms of the number of workers, but there are other important factors to consider:
- Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.
- Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
- Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.
Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:
- How much data will your workload consume?
- What's the computational complexity of your workload?
- Where are you reading data from?
- How is the data partitioned in external storage?
- How much parallelism do you need?
There is a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 16 cores and 128 GB of RAM, has the same compute and memory as configuring compute with 8 workers, each with 4 cores and 32 GB of RAM.
Compute configuration examples
The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.
All of the examples in this section could benefit from using serverless compute rather than spinning up a new compute resource. If your workload isn't supported on serverless, use the recommendations below to help configure your classic compute resource.
Data analysis
Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A compute resource with a smaller number of larger nodes can reduce the network and disk I/O needed to perform these shuffles.
A single-node compute with a large VM type is likely the best choice, particularly for a single analyst.
Analytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled or instances with local storage.
Additional features recommended for analytical workloads include:
- Enable auto termination to ensure compute is terminated after a period of inactivity.
- Consider enabling autoscaling based on the analyst's typical workload.
Basic batch ETL
For simple batch ETL jobs that don't require wide transformations, such as joins or aggregations, use instances with lower requirements for memory and storage. This might result in cost savings over other worker types.
Complex batch ETL
For a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends using fewer workers to reduce the amount of data shuffled. To compensate for having fewer workers, increase the size of your instances.
Complex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, increase the amount of memory available on your instances.
Optionally, use instance pools to decrease compute launch times and reduce total runtime when running job pipelines.
Training machine learning models
To train machine learning models, Databricks recommends creating a compute resource using the Personal compute policy.
Use a single node compute with a large node type for initial experimentation. Having fewer nodes reduces the impact of shuffles.
Adding more workers can help with stability, but avoid adding too many workers because of the overhead of shuffling data.
Recommended worker types are storage optimized with disk caching enabled, or an instance with local storage to account for repeated reads of the same data and to enable caching of training data.
Additional features recommended for machine learning workloads include:
- Enable auto termination to ensure compute is terminated after a period of inactivity.
- Use instance pools, which allow restricting compute to a pre-approved instance type.
- Ensure consistent compute configurations using policies.