Compute creation cheat sheet
This article aims to provide clear and opinionated guidance for compute creation. By using the right compute types for your workflow, you can improve performance and save on costs.
Best Practice |
Impact |
Docs |
---|---|---|
If you are new to Databricks, start by using general all-purpose instance types |
Selecting the appropriate instance type for the workload results in higher efficiency. |
|
Use shared access mode unless your required functionality isn’t supported |
Compute with shared access mode can be used by multiple users with data isolation among users. |
|
Use Graviton instance types if they are available |
Instance types with Graviton processors have the best price-to-performance ratio of any instance type, according to AWS. |
|
Use the latest generation instance types if there is enough availability |
The latest generation of instance types provide the best performance and latest features. |
|
Set your on-demand and spot-instance balance based on how quickly you need your workload to run |
Spot instances save on cost but can affect the overall run time of an operation if the spot instances are reclaimed. |
|
Choose the size of your nodes and the number of workers based on the types of operations your workload performs |
For example, if you expect a lot of shuffles, it can be more efficient to use a large single node instead of multiple smaller nodes. |
|
Run vacuum on a cluster with auto-scaling set for 1-4 workers, where each worker has 8 cores. Select a driver with between 8 and 32 cores. Increase the size of the driver if you get out-of-memory (OOM) errors. |
Vacuum statements happen in two phases, the second of which is driver-heavy. If you don’t use the right-sized cluster, the operation could cause a slowdown and might not succeed. |
|
Assess whether your batch workflow would benefit from Photon |
Photon provides faster queries and reduces your total cost per workload. |