Configure compute for Lakeflow Declarative Pipelines

This article contains instructions and considerations when configuring custom compute settings for Lakeflow Declarative Pipelines.

Serverless pipelines do not provide compute configuration options. See Configure a serverless pipeline.

Select a cluster policy

Users must have permission to deploy compute to configure and update Lakeflow Declarative Pipelines. Workspace admins can configure cluster policies to provide users with access to compute resources for Lakeflow Declarative Pipelines. See Define limits on Lakeflow Declarative Pipelines compute.

note

Cluster policies are optional. Check with your workspace administrator if you lack the compute privileges required for Lakeflow Declarative Pipelines.
To ensure that cluster policy default values are correctly applied, set apply_policy_default_values to true in the cluster configurations in your pipeline configuration:
JSON
```
{
  "clusters": [
    {
      "label": "default",
      "policy_id": "<policy-id>",
      "apply_policy_default_values": true
    }
  ]
}
```

Configure cluster tags

You can use cluster tags to monitor usage for your pipeline clusters. Add cluster tags in the Lakeflow Declarative Pipelines UI when you create or edit a pipeline or by editing the JSON settings for your pipeline clusters.

Select instance types to run a pipeline

By default, Lakeflow Declarative Pipelines selects the instance types for your pipeline's driver and worker nodes. You can optionally configure the instance types.

For example, select instance types to improve pipeline performance or address memory issues when running your pipeline. You can configure instance types when you create or edit a pipeline with the REST API, or in the Lakeflow Declarative Pipelines UI.

To configure instance types when you create or edit a pipeline in the Lakeflow Declarative Pipelines UI:

Click the Settings button.
In the Advanced section of the pipeline settings, in the Worker type and Driver type drop-down menus, select the instance types for the pipeline.

Advanced compute configurations

note

Because compute resources are fully managed for serverless Lakeflow Declarative Pipelines, compute settings are unavailable when you select Serverless for a pipeline.

Each declarative pipeline has two associated clusters:

The updates cluster processes pipeline updates.
The maintenance cluster runs daily maintenance tasks.

Compute settings specified using the workspace pipeline configuration UI apply to both update and maintenance clusters. You must edit the JSON configuration to modify these settings independently.

The configuration these clusters use is determined by the clusters attribute specified in your pipeline settings.

Using cluster labels, you can add compute settings that apply to only a specific cluster type. There are three labels you can use when configuring pipeline clusters:

note

The cluster label setting can be omitted if you define only one cluster configuration. The default label is applied to cluster configurations if no setting for the label is provided. The cluster label setting is required only if you need to customize settings for different cluster types.

The default label defines compute settings for both the updates and maintenance clusters. Applying the same settings to both clusters improves the reliability of maintenance runs by ensuring that required configurations such as data access credentials for a storage location are applied to the maintenance cluster.
The maintenance label defines compute settings that apply to only the maintenance cluster. You can also use the maintenance label to override settings configured by the default label.
The updates label defines settings that apply to only the updates cluster. Use it to configure settings that should not be applied to the maintenance cluster.

Settings defined using the default and updates labels are merged to create the final configuration for the updates cluster. If the same setting is defined using both default and updates labels, the setting defined with the updates label overrides the setting defined with the default label.

The following example defines a Spark configuration parameter that is added only to the configuration for the updates cluster:

JSON
{
  "clusters": [
    {
      "label": "default",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5,
        "mode": "ENHANCED"
      }
    },
    {
      "label": "updates",
      "spark_conf": {
        "key": "value"
      }
    }
  ]
}

Lakeflow Declarative Pipelines has similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI. See Compute.

note

Because the Lakeflow Declarative Pipelines runtime manages the lifecycle of pipeline clusters and runs a custom version of Databricks Runtime, you cannot manually set some cluster settings in a pipeline configuration, such as the Spark version or cluster names. See Cluster attributes that are not user settable.

Configure instance types for update and maintenance clusters

To configure instance types in the pipeline's JSON settings, click the JSON button and enter the instance type configurations in the cluster configuration:

note

To avoid assigning unnecessary resources to the maintenance cluster, this example uses the updates label to set the instance types for only the updates cluster. To assign the instance types to both updates and maintenance clusters, use the default label or omit the setting for the label. The default label is applied to pipeline cluster configurations if no setting for the label is provided. See Advanced compute configurations.

JSON
{
  "clusters": [
    {
      "label": "updates",
      "node_type_id": "r6i.xlarge",
      "driver_node_type_id": "i3.large",
      "...": "..."
    }
  ]
}

Delay compute shutdown

To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

JSON
{
  "configuration": {
    "pipelines.clusterShutdown.delay": "60s"
  }
}

When production mode is enabled, the default value for pipelines.clusterShutdown.delay is 0 seconds. When development mode is enabled, the default value is 2 hours.

note

Because a Lakeflow Declarative Pipelines cluster automatically shuts down when not in use, referencing a cluster policy that sets autotermination_minutes in your cluster configuration results in error.

Create a single node cluster

Preview

This feature is in Private Preview. To try it, reach out to your Databricks contact.

If you set num_workers to 0 in cluster settings, the cluster is created as a single node cluster. Configuring an autoscaling cluster and setting min_workers to 0 and max_workers to 0 creates a single node cluster.

If you configure an autoscaling cluster and set only min_workers to 0, the cluster is not created as a single node cluster. The cluster has at least one active worker at all times until terminated.

An example cluster configuration to create a single node cluster in Lakeflow Declarative Pipelines:

JSON
{
  "clusters": [
    {
      "num_workers": 0
    }
  ]
}

Select a cluster policy​

Configure cluster tags​

Select instance types to run a pipeline​

Advanced compute configurations​

Configure instance types for update and maintenance clusters​

Delay compute shutdown​

Create a single node cluster​

Select a cluster policy

Configure cluster tags

Select instance types to run a pipeline

Advanced compute configurations

Configure instance types for update and maintenance clusters

Delay compute shutdown

Create a single node cluster