Estimated time to complete: 20 minutes
In order to run your data analysis workflows in Databricks notebooks, you must attach your notebook to a cluster. Usually, a Databricks administrator creates clusters for you. However, if you are asked to do this yourself, the following videos provide a useful overview of clusters: what they are, and how to choose the correct type for your work.
A Databricks cluster is a set of computation resources that performs the heavy lifting of all of the data workloads you run in Databricks. These workloads can be run as commands in notebooks, commands run from BI tools that are connected to Databricks, or automated jobs that you’ve scheduled. Clusters perform the processing of these workloads and then return results or save them out to data stores.
A cluster consists of multiple nodes (individual machines) that operate on your workloads in parallel. There is one driver node for every cluster, which is the one that delegates tasks and oversees the execution of your specific workload. There are also many worker nodes for every cluster that perform the processing. If a worker node in a Databricks cluster is lost for any reason, the driver can reallocate remaining work to the remaining nodes.
One thing to note about clusters is that there is a large amount of customization that can be made at every level of the cluster: custom images, configurations, initialization scripts, and security controls.
The way that you configure your clusters depends on the workloads you’re running on them. This video reviews the standard configuration options:
For more information about cluster configurations, including advanced configurations, we recommend reviewing Configure clusters.
As you start creating clusters, you may find that there is an overwhelming set of options and capabilities available for you to choose from. This video presents best practices for configuring clusters and highlights common scenarios you might encounter when you create your clusters.
For more information about fine tuning and enhancing the jobs you run on Databricks, we encourage you to review the following content: