When you configure a cluster’s AWS instances you can choose the availability zone, the spot bid price, EBS volume type and size, and IAM roles. To specify configurations,
On the cluster configuration page, click the Advanced Options toggle.
At the bottom of the page, click the Instances tab.
Choosing a specific availability zone for a cluster is useful primarily if your organization has purchased reserved instances in specific availability zones. Read more about AWS availability zones.
You can specify the bid price to use when launching spot instances as a percentage of the corresponding on-demand price. By default, Databricks bids at 100% of the on-demand price. Read more about AWS spot pricing.
AWS charges for the spot instance at the spot market price, not at the bid price. AWS uses the bid price to terminate the spot instance if the spot market price surges above the bid price. The higher your bid value, the less likely it is for AWS to terminate your spot instances.
This section describes the default EBS volume settings for worker nodes, how to add shuffle volumes, and how to configure a cluster so that Databricks automatically allocates EBS volumes.
To configure EBS volumes, click the Instances tab in the cluster configuration and select an option in the EBS Volume Type drop-down list.
Databricks provisions EBS volumes for every worker node as follows:
- A 30 GB unencrypted EBS instance root volume used only by the host operating system and Databricks internal services.
- A 150 GB encrypted EBS container root volume used by the Spark worker. This hosts Spark services and logs.
- (HIPAA only) a 75 GB encrypted EBS worker log volume that stores logs for Databricks internal services.
To add shuffle volumes, select General Purpose SSD in the EBS Volume Type drop-down list:
By default, Spark shuffle outputs go to the instance local disk. For instance types that do not have a local disk, or if you want to increase your Spark shuffle storage space, you can specify additional EBS volumes. This is particularly useful to prevent out of disk space errors when you run Spark jobs that produce large shuffle outputs.
Databricks encrypts these EBS volumes for both on-demand and spot instances. Read more about AWS EBS volumes.
Ensure that your AWS EBS limits are high enough to satisfy the runtime requirements for all workers in all clusters. For information on the default EBS limits and how to change them, see Amazon Elastic Block Store (EBS) Limits.
If you don’t want to allocate a fixed number of EBS volumes at cluster creation time, use autoscaling local storage. With autoscaling local storage, Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new EBS volume to the worker before it runs out of disk space. EBS volumes are attached up to a limit of 5 TB of total disk space per instance (including the instance’s local storage).
To configure autoscaling storage, select Enable autoscaling local storage in the Autopilot Options box:
The EBS volumes attached to an instance are detached only when the instance is returned to AWS. That is, EBS volumes are never detached from an instance as long as it is part of a running cluster. To scale down EBS usage, Databricks recommends using this feature in a cluster configured with Autoscaling or Automatic termination.
Databricks uses Throughput Optimized HDD (st1) to extend the local storage of an instance. The default AWS capacity limit for these volumes is 20 TiB. To avoid hitting this limit, administrators should request an increase in this limit based on their usage requirements.
If you created your Databricks account prior to version 2.44 (that is, before Apr 27, 2017) and want to use autoscaling local storage (enabled by default in Cluster Mode), you must add volume permissions to the IAM role or keys used to create your account. In particular, you must add the permissions
ec2:DescribeVolumes. For the complete list of permissions and instructions on how to update your existing IAM role or keys, see AWS Account.
To securely access AWS resources without using AWS keys, you can launch Databricks clusters with IAM roles. See Secure Access to S3 Buckets Using IAM Roles for details on how to create and configure IAM roles. Once you have created an IAM role, you select the role in the IAM Role drop-down list:
Once a cluster launches with an IAM role, anyone who has attach permissions to this cluster can access the underlying resources controlled by this role. To guard against unwanted access, you can use Cluster Access Control to restrict permissions to the cluster.