Configure classic compute for pipelines

This page contains instructions for configuring classic compute for Lakeflow Spark Declarative Pipelines. For a reference of the JSON schema, see the clusters definition in the Pipeline API reference.

To create a pipeline that runs on classic compute, users must first have permission to deploy classic compute, either unrestricted creation permission or access to a compute policy. Serverless pipelines do not require compute creation permissions. By default, all workspace users can use serverless pipelines.

note

Because the Lakeflow Spark Declarative Pipelines runtime manages the lifecycle of pipeline compute and runs a custom version of Databricks Runtime, you cannot manually set some compute settings in a pipeline configuration, such as the Spark version or cluster names. See Cluster attributes that are not user settable.

Select compute for your pipeline

To configure classic compute for your pipeline from the Lakeflow Pipelines Editor:

Click Settings.
In the Compute section of the pipeline settings, click edit.
If it is checked, uncheck Serverless.
Make any other changes to compute settings, then click Save.

This configures your pipeline to use classic compute, and allows you to edit compute settings, as described below.

For more information about the Lakeflow Pipelines Editor, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor.

Select a compute policy

Workspace admins can configure compute policies to provide users with access to classic compute resources for pipelines. Compute policies are optional. Check with your workspace administrator if you lack the compute privileges required. See Define limits on Lakeflow Spark Declarative Pipelines compute.

When using the Pipelines API, to ensure that compute policy default values are correctly applied, set "apply_policy_default_values": true in the clusters definition:

JSON
{
  "clusters": [
    {
      "label": "default",
      "policy_id": "<policy-id>",
      "apply_policy_default_values": true
    }
  ]
}

Configure compute tags

You can add custom tags to your pipeline's classic compute resources. Tags allow you to monitor the cost of compute resources used by various groups in your organization. Databricks applies these tags to cloud resources and to usage logs recorded in the usage system tables. You can add tags using the Cluster tags UI setting or by editing the JSON configuration of your pipeline.

Select instance types to run a pipeline

By default, Lakeflow Spark Declarative Pipelines selects the instance types for your pipeline's driver and worker nodes. You can optionally configure the instance types. For example, select instance types to improve pipeline performance or address memory issues when running your pipeline.

To configure instance types when you create or edit a pipeline in the Lakeflow Pipelines Editor:

Click the Settings button.
In the Compute section of the pipeline settings, click the .
In the Advanced settings section, select the Worker type and Driver type instance types for the pipeline.

Configure separate settings for the update and maintenance clusters

Each declarative pipeline has two associated compute resources: an update cluster that processes pipeline updates and a maintenance cluster that runs daily maintenance tasks (including predictive optimization). By default, your compute configurations apply to both of these clusters. Using the same settings for both clusters improves the reliability of maintenance runs by ensuring that required configurations such as data access credentials for a storage location are applied to the maintenance cluster.

To apply settings to only one of the two clusters, add the label field to the setting JSON object. There are three possible values for the label field:

maintenance: Applies the setting only to the maintenance cluster.
updates: Applies the setting only to the update cluster.
default: Applies the setting to both the update and maintenance clusters. This is the default value if the label field is omitted.

If there is a conflicting setting, the setting with the updates or maintenance label overrides the setting defined with the default label.

note

The daily maintenance cluster is used only in certain cases:

Pipelines stored in Hive metastore.
Pipelines in workspaces that have not accepted the serverless compute terms of service. If you need assistance accepting the terms, contact your Databricks representative.

Example: Define a setting for the update cluster

The following example defines a Spark configuration parameter that is added only to the configuration for the updates cluster:

JSON
{
  "clusters": [
    {
      "label": "default",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5,
        "mode": "ENHANCED"
      }
    },
    {
      "label": "updates",
      "spark_conf": {
        "key": "value"
      }
    }
  ]
}

Example: Configure instance types for the update cluster

To avoid assigning unnecessary resources to the maintenance cluster, this example uses the updates label to set the instance types for only the updates cluster.

JSON
{
  "clusters": [
    {
      "label": "updates",
      "node_type_id": "n1-highmem-16",
      "driver_node_type_id": "n1-standard-4",
      "...": "..."
    }
  ]
}

Delay compute shutdown

To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

JSON
{
  "configuration": {
    "pipelines.clusterShutdown.delay": "60s"
  }
}

When production mode is enabled, the default value for pipelines.clusterShutdown.delay is 0 seconds. When development mode is enabled, the default value is 2 hours.

note

Because Lakeflow Spark Declarative Pipelines compute resource automatically shut down when not in use, you cannot use a compute policy that sets autotermination_minutes. This results in an error.

Create a single node compute

A single node compute has a driver node that acts as both master and worker. This is intended for workloads that use small amounts of data or are not distributed.

To create a single-node compute, set num_workers to 0. For example:

JSON
{
  "clusters": [
    {
      "num_workers": 0
    }
  ]
}

Select compute for your pipeline​

Select a compute policy​

Configure compute tags​

Select instance types to run a pipeline​

Configure separate settings for the update and maintenance clusters​

Example: Define a setting for the update cluster​

Example: Configure instance types for the update cluster​

Delay compute shutdown​

Create a single node compute​

Select compute for your pipeline

Select a compute policy

Configure compute tags

Select instance types to run a pipeline

Configure separate settings for the update and maintenance clusters

Example: Define a setting for the update cluster

Example: Configure instance types for the update cluster

Delay compute shutdown

Create a single node compute