Manage Clusters

This topic describes how to manage Databricks clusters, including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs.

Display clusters

To display the clusters in your workspace, click the clusters icon Clusters Menu Icon in the sidebar.

The Clusters page displays two lists: Interactive Clusters and Automated Clusters. Each list includes:

  • Cluster name
  • State
  • Number of nodes
  • Type of driver and worker nodes
  • Databricks Runtime version
  • Cluster creator or job owner

In addition to the common cluster information, the Interactive Clusters list shows the numbers of notebooks Attached Notebooks and libraries Attached Libraries attached to the cluster. Above the list is the number of pinned clusters.

../_images/interactive-clusters.png
../_images/automated-clusters.png

An icon to the left of an interactive cluster name indicates whether the cluster is pinned, whether the cluster offers a high concurrency cluster, and whether table access control is enabled:

  • Pinned Pinned
  • Starting Starting , Terminating Terminating
  • Standard cluster
    • Running Running
    • Terminated Terminated
  • High concurrency cluster
    • Running Serverless
    • Terminated Serverless Terminated
  • Access Denied
    • Running Locked
    • Terminated Locked Terminated
  • Table ACLs enabled
    • Running Table ACLs
    • Terminated Table ACLs Terminated

Links and buttons at the far right of an interactive cluster provide access to the Spark UI and logs and the terminate, restart, clone, permissions, and delete actions.

../_images/interactive-cluster-actions.png

Links and buttons at the far right of an automated cluster provide access to the Job Run page, Spark UI and logs, and the terminate, clone, and permissions actions.

../_images/job-cluster-actions.png

Filter cluster list

You can filter the cluster lists using the buttons and Filter field at the top right:

../_images/cluster-filters.png
  • To display only clusters that you created, click Created by me.
  • To display only clusters that are accessible to you (if cluster access control is enabled), click Accessible by me.
  • To filter by a string that appears in any field, type the string in the Filter text box.

Pin a cluster

To keep an interactive cluster configuration even after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 20 clusters can be pinned.

You can pin a cluster from the:

  • Cluster list

    To pin or unpin a cluster, click the pin icon to the left of the cluster name.

    ../_images/pin-list.png
  • Cluster detail page

    To pin or unpin a cluster, click the pin icon to the right of the cluster name.

    ../_images/pin-detail.png

You can also invoke the Pin API endpoint to programmatically pin a cluster.

View a cluster configuration as a JSON file

Sometimes it can be helpful to view your cluster configuration as JSON. This is especially useful when you want to create similar clusters using the Clusters API. When you view an existing cluster, simply go to the Configuration tab, click JSON in the top right of the tab, copy the JSON, and paste it into your API call. JSON view is ready-only.

../_images/cluster-json-aws.png

Edit a cluster

You edit a cluster configuration from the cluster detail page.

../_images/cluster-edit.png

You can also invoke the Edit API endpoint to programmatically edit the cluster.

Note

  • Notebooks and jobs that were attached to the cluster remain attached after editing.
  • Libraries installed on the cluster remain installed after editing.
  • If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. This can disrupt users who are currently using the cluster.
  • You can edit only running or terminated clusters. You can, however, update permissions for clusters that are not in those states on the cluster details page.

For detailed information about cluster configuration properties you can edit, see Cluster Configurations.

Clone a cluster

You can create a new cluster by cloning an existing cluster.

  • Cluster list

    ../_images/clone-list.png
  • Cluster detail page

    ../_images/clone-details.png

The cluster creation form is opened prepopulated with the cluster configuration. The following attributes from the existing cluster are not included in the clone:

  • Cluster permissions
  • Installed libraries
  • Attached notebooks

Control access to clusters

Cluster access control allows admins and delegated users to give fine-grained cluster access to other users. Broadly, there are two types of cluster access control:

  1. Cluster creation permission: Admins can choose which users are allowed to create clusters.

    ../_images/acl-allow-user.png
  2. Cluster-level permissions: A user who has the Can manage permission for a cluster can configure whether other users can attach to, restart, resize, and manage that cluster.

    ../_images/acl-list.png

To learn how to configure cluster access control and cluster-level permissions, see Cluster Access Control.

Start a cluster

Apart from creating a new cluster, you can also start a previously terminated cluster. This lets you re-create a previously terminated cluster with its original configuration.

You can start a cluster from the:

  • Cluster list:

    ../_images/start-list.png
  • Cluster detail page:

    ../_images/start-details.png
  • Notebook Notebook Attach cluster attach dropdown:

    ../_images/start-from-notebook.png

You can also invoke the Start API endpoint to programmatically start a cluster.

Databricks identifies a cluster with a unique cluster ID. When you start a terminated cluster, Databricks re-creates the cluster with the same ID, automatically installs all the libraries, and re-attaches the notebooks.

Cluster autostart for jobs

When a job assigned to an existing terminated cluster is scheduled to run or you connect to a terminated cluster from a JDBC/ODBC interface, the cluster is automatically restarted. See Create a job and JDBC connect.

Cluster autostart allows you to configure clusters to autoterminate without requiring manual intervention to restart the clusters for scheduled jobs. Furthermore, you can schedule cluster initialization by scheduling a job to run on a terminated cluster.

Before a cluster is restarted automatically, cluster and job access control permissions are checked.

Note

If your cluster was created in Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to run on terminated clusters will fail.

Terminate a cluster

To save cluster resources, you can terminate a cluster. A terminated cluster cannot run notebooks or jobs, but its configuration is stored so that it can be reused (or—in the case of some types of jobs—autostarted) at a later time. You can manually terminate a cluster or configure the cluster to automatically terminate after a specified period of inactivity. Databricks records information whenever a cluster is terminated.

../_images/termination-reason.png

Note

When you run a job on a New Automated Cluster (which is usually recommended), the cluster terminates and is unavailable for restarting when the job is complete. On the other hand, if you schedule a job to run on an Existing Interactive Cluster that has been terminated, that cluster will autostart.

Databricks retains the configuration information for up to 70 interactive clusters terminated in the last 30 days and up to 30 automated clusters recently terminated by the job scheduler. To keep an interactive cluster configuration even after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.

Manual termination

You can manually terminate a cluster from the

  • Cluster list

    ../_images/terminate-list.png
  • Cluster detail page

    ../_images/terminate-details.png

Automatic termination

You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Databricks automatically terminates that cluster.

A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing. This does not include commands run by SSH-ing into the cluster and running bash commands.

Warning

  • Clusters do not report activity resulting from the use of DStreams. This means that an autoterminating cluster may be terminated while it is running DStreams. Turn off auto termination for clusters running DStreams or consider using Structured Streaming.
  • The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs have completed, a cluster may be terminated even if local processes are running.

Configure automatic termination

You configure automatic termination in the Auto Termination field in the Autopilot Options box on the cluster creation page:

../_images/autopilot-aws.png

Important

The default value of the auto terminate setting depends on whether you choose to create a standard or high concurrency cluster:

  • Standard clusters are configured to terminate automatically after 120 minutes.
  • High concurrency clusters are configured to not terminate automatically.

You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of 0.

Note

Auto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can result in inaccurate reporting of cluster activity. For example, clusters running JDBC, R, or streaming commands can report a stale activity time that leads to premature cluster termination. Please upgrade to the most recent Spark version to benefit from bug fixes and improvements to auto termination.

Unexpected termination

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. For a list of termination reasons and remediation steps, see the Knowledge Base.

Delete a cluster

Deleting a cluster terminates the cluster and removes its configuration.

Warning

You cannot undo this action.

You cannot delete a pinned cluster. In order to delete a pinned cluster, it must first be unpinned by an administrator.

To delete a cluster, click the Delete Cluster Icon icon in the cluster actions on the Clusters page.

../_images/delete-list.png

You can also invoke the Permanent Delete API endpoint to programmatically delete a cluster.

View cluster information in the Apache Spark UI

Detailed information about Spark jobs is displayed in the Spark UI, which you can access from:

  • The cluster list: click the Spark UI link on the cluster row.
  • The cluster details page: click the Spark UI tab.

The Spark UI displays cluster history for both active and terminated clusters.

../_images/spark-ui-aws.png

Note

If a terminated cluster is restarted, the Spark UI displays information for the restarted cluster, not the historical information for the terminated cluster.

View cluster logs

Databricks provides three kinds of logging of cluster-related activity:

This section discusses cluster event logs and driver and worker logs. For details about init-script logs, see Cluster-scoped init script logs.

Cluster event logs

The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Databricks. Such events affect the operation of a cluster as a whole and the jobs running in the cluster.

For supported event types, see the REST API ClusterEventType data structure.

Events are stored for 60 days, which is comparable to other data retention times in Databricks.

View a cluster event log

  1. Click the clusters icon Clusters Menu Icon in the sidebar.

  2. Click a cluster name.

  3. Click the Event Log tab.

    ../_images/cluster-event-log.png

To filter the events, click the menu dropdown in the Filter by Event Type… field and select one or more event type checkboxes.

Use Select all to make it easier to filter by excluding particular event types.

../_images/cluster-event-log-filter.gif

View event details

For more information about an event, click its row in the log and then click the JSON tab for details.

../_images/cluster-event-details.png

Cluster driver and worker logs

The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. These logs have three outputs:

  • Standard output
  • Standard error
  • Log4j logs

To access these driver log files from the UI, go to the Driver Logs tab on the cluster details page.

../_images/driver-logs.png

Log files are rotated periodically. Older log files appear at the top of the page, listed with timestamp information. You can download any of the logs for troubleshooting.

To view Spark worker logs, you can use the Spark UI. You can also configure a log delivery location for the cluster. Both worker and cluster logs are delivered to the location you specify.

Monitor performance

To help you monitor the performance of Databricks clusters, Databricks provides access to Ganglia metrics from the cluster details page.

You can also install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.

In this section:

Ganglia metrics

To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. CPU metrics are available in the Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled clusters running Databricks Runtime 4.1 and above.

../_images/metrics-tab.png

To view live metrics, click the Ganglia UI link.

To view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding the selected time.

Note

Ganglia metrics are not supported on clusters with table access control enabled.

Configure metrics collection

By default, Databricks collects Ganglia metrics every 15 minutes. To configure the collection period, set the DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES environment variable using an init script or in the spark_env_vars field in the Cluster Create API.

Datadog metrics

../_images/datadog-metrics.png

You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.

To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.

Install Datadog agent init script notebook