This article describes how to manage Databricks clusters, including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs.
To view the clusters in your workspace, click Compute in the sidebar.
On the left side are two columns indicating if the cluster has been pinned and the status of the cluster. Hover over the status to get more information.
30 days after a cluster is terminated, it is permanently deleted. To keep an all-purpose cluster configuration after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 100 clusters can be pinned.
Admins can pin a cluster from the cluster list or the cluster detail page by clicking the pin icon.
You can also invoke the Clusters API endpoint to pin a cluster programmatically.
Sometimes it can be helpful to view your cluster configuration as JSON. This is especially useful when you want to create similar clusters using the Clusters API. When you view an existing cluster, go to the Configuration tab, click JSON in the top right of the tab, copy the JSON, and paste it into your API call. JSON view is read-only.
You can edit a cluster configuration from the cluster details UI. You can also invoke the Clusters API endpoint to edit the cluster programmatically.
Notebooks and jobs that were attached to the cluster remain attached after editing.
Libraries installed on the cluster remain installed after editing.
If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. This can disrupt users who are currently using the cluster.
You can only edit running or terminated clusters. You can, however, update permissions for clusters that are not in those states, on the cluster details page.
To clone an existing cluster, select Clone from the cluster’s kebab menu (also known as the three-dot menu).
After you select clone, the cluster creation UI opens pre-populated with the cluster configuration. The following attributes are not included in the clone:
Cluster access control within the admin settings page allows workspace admins to give fine-grained cluster access to other users. There are two types of cluster access control:
Cluster-creation permission: Workspace admins can choose which users are allowed to create clusters.
Cluster-level permissions: A user who has the Can manage permission for a cluster can configure whether other users can attach to, restart, resize, and manage that cluster.
To edit permissions for a cluster, select Edit Permissions from that cluster’s kebab menu.
For more on cluster access control and cluster-level permissions, see Cluster access control.
To save cluster resources, you can terminate a cluster. The terminated cluster’s configuration is stored so that it can be reused (or, in the case of jobs, autostarted) at a later time. You can manually terminate a cluster or configure the cluster to terminate automatically after a specified period of inactivity. When the number of terminated clusters exceeds 150, the oldest clusters are deleted.
Unless a cluster is pinned or restarted, it is automatically and permanently deleted 30 days after termination.
Terminated clusters appear in the cluster list with a gray circle at the left of the cluster name.
When you run a job on a New Job Cluster (which is usually recommended), the cluster terminates and is unavailable for restarting when the job is complete. On the other hand, if you schedule a job to run on an Existing All-Purpose Cluster that has been terminated, that cluster will autostart.
You can manually terminate a cluster from the cluster list (by clicking the square on the cluster’s row) or the cluster detail page (by clicking Terminate).
You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate.
If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Databricks automatically terminates that cluster.
A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing. This does not include commands run by SSH-ing into the cluster and running bash commands.
Clusters do not report activity resulting from the use of DStreams. This means that an auto-terminating cluster may be terminated while it is running DStreams. Turn off auto termination for clusters running DStreams or consider using Structured Streaming.
The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs have completed, a cluster may be terminated, even if local processes are running.
Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.
You can configure automatic termination in the create cluster UI. Ensure that the box is checked, and enter the number of minutes in the Terminate after ___ of minutes of inactivity setting.
You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of
Auto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can result in inaccurate reporting of cluster activity. For example, clusters running JDBC, R, or streaming commands can report a stale activity time that leads to premature cluster termination. Please upgrade to the most recent Spark version to benefit from bug fixes and improvements to auto termination.
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination.
For a list of termination reasons and remediation steps, see the Knowledge Base.
Deleting a cluster terminates the cluster and removes its configuration. To delete a cluster, select Delete from the cluster’s menu.
You cannot undo this action.
To delete a pinned cluster, it must first be unpinned by an administrator.
You can also invoke the Clusters API endpoint to delete a cluster programmatically.
You can restart a previously terminated cluster from the cluster list, the cluster detail page, or a notebook. You can also invoke the Clusters API endpoint to start a cluster programmatically.
Databricks identifies a cluster using its unique cluster ID. When you start a terminated cluster, Databricks re-creates the cluster with the same ID, automatically installs all the libraries, and reattaches the notebooks.
When you restart a cluster, it gets the latest images for the compute resource containers and the VM hosts. It is important to schedule regular restarts for long-running clusters such as those used for processing streaming data.
It is your responsibility to restart all compute resources regularly to keep the image up-to-date with the latest image version.
If you enable the compliance security profile for your account or your workspace, long-running clusters are automatically restarted after 25 days. Databricks recommends that workspace admins restart clusters manually during a scheduled maintenance window. This reduces the risk of an auto-restart disrupting a scheduled job.
If your workspace is part of the public preview of automatic cluster update, the 25 day limit does not apply. Clusters restart only if needed during the scheduled maintenance windows.
If you are a workspace admin, you can run a script that determines how long each of your clusters has been running, and optionally, restart them if they are older than a specified number of days. Databricks provides this script as a notebook.
If your workspace is part of the public preview of automatic cluster update, you might not need this script. Clusters restart automatically if needed during the scheduled maintenance windows.
The first lines of the script define configuration parameters:
min_age_output: The maximum number of days that a cluster can run. Default is 1.
True, the script restarts clusters with age greater than the number of days specified by
min_age_output. The default is
False, which identifies the long-running clusters but does not restart them.
REPLACE_WITH_KEYwith a secret scope and key name. For more details of setting up the secrets, see the notebook.
If you set
True, the script automatically restarts eligible clusters, which can cause active jobs to fail and reset open notebooks. To reduce the risk of disrupting your workspace’s business-critical jobs, plan a scheduled maintenance window and be sure to notify the workspace users.
When a job assigned to a terminated cluster is scheduled to run, or you connect to a terminated cluster from a JDBC/ODBC interface, the cluster is automatically restarted. See Create a job and JDBC connect.
Cluster autostart allows you to configure clusters to auto-terminate without requiring manual intervention to restart the clusters for scheduled jobs. Furthermore, you can schedule cluster initialization by scheduling a job to run on a terminated cluster.
If your cluster was created in Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to run on terminated clusters will fail.
You can view detailed information about Spark jobs by selecting the Spark UI tab on the cluster details page.
If you restart a terminated cluster, the Spark UI displays information for the restarted cluster, not the historical information for the terminated cluster.
Databricks provides three kinds of logging of cluster-related activity:
Cluster event logs, which capture cluster lifecycle events like creation, termination, and configuration edits.
Apache Spark driver and worker log, which you can use for debugging.
Cluster init-script logs, which are valuable for debugging init scripts.
This section discusses cluster event logs and driver and worker logs. For details about init-script logs, see Init script logging.
The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Databricks. Such events affect the operation of a cluster as a whole and the jobs running in the cluster.
For supported event types, see the Clusters API data structure.
Events are stored for 60 days, which is comparable to other data retention times in Databricks.
The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. You can access these log files from the Driver logs tab on the cluster details page. Click the name of a log file to download it.
These logs have three outputs:
To view Spark worker logs, use the Spark UI tab. You can also configure a log delivery location for the cluster. Both worker and cluster logs are delivered to the location you specify.
To help you monitor the performance of Databricks clusters, Databricks provides access to metrics from the cluster details page. For Databricks Runtime 12.2 and below, Databricks provides access to Ganglia metrics. For Databricks Runtime 13.0 and above, cluster metrics are provided by Databricks.
You can also install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.
Clusters metrics is the default monitoring tool for Databricks Runtime 13.0 and above. To access the cluster metrics UI, navigate to the Metrics tab on the cluster details page.
You can view historical metrics by selecting a time range using the date picker filter. Metrics are collected every minute. You can also get the latest metrics by clicking the Refresh button. For more information, see View live and historical cluster metrics.
Ganglia metrics are only available for Databricks Runtime 12.2 and below.
To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. CPU metrics are available in the Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled clusters.
To view live metrics, click the Ganglia UI link.
To view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding the selected time.
Ganglia isn’t supported with Docker containers. If you use a Docker container with your cluster, Ganglia metrics will not be available.
You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.
To install the Datadog agent on all clusters, manage the cluster-scoped init script using a cluster policy.
This feature is available on Databricks Runtime 8.0 and above.
Because spot instances can reduce costs, creating clusters using spot instances rather than on-demand instances is a common way to run jobs. However, spot instances can be preempted by cloud provider scheduling mechanisms. Preemption of spot instances can cause issues with jobs that are running, including:
Shuffle fetch failures
Shuffle data loss
RDD data loss
You can enable decommissioning to help address these issues. Decommissioning takes advantage of the notification that the cloud provider usually sends before a spot instance is decommissioned. When a spot instance containing an executor receives a preemption notification, the decommissioning process will attempt to migrate shuffle and RDD data to healthy executors. The duration before the final preemption is typically 30 seconds to 2 minutes, depending on the cloud provider.
Databricks recommends enabling data migration when decommissioning is also enabled. Generally, the possibility of errors decreases as more data is migrated, including shuffle fetching failures, shuffle data loss, and RDD data loss. Data migration can also lead to less re-computation and saved costs.
Decommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.
With decommissioning enabled, task failures caused by spot instance preemption are not added to the total number of failed attempts. Task failures caused by preemption are not counted as failed attempts because the cause of the failure is external to the task and will not result in job failure.
To enable decommissioning on a cluster, enter the following properties in the Spark tab under Advanced Options in the cluster configuration UI.
To enable decommissioning for applications, enter this property in the Spark config field:
To enable shuffle data migration during decommissioning, enter this property in the Spark config field:
spark.storage.decommission.enabled true spark.storage.decommission.shuffleBlocks.enabled true
To enable RDD cache data migration during decommissioning, enter this property in the Spark config field:
spark.storage.decommission.enabled true spark.storage.decommission.rddBlocks.enabled true
When RDD StorageLevel replication is set to more than 1, Databricks does not recommend enabling RDD data migration since the replicas ensure RDDs will not lose data.
To enable decommissioning for workers, enter this property in the Environment Variables field: