Skip to main content

Configure classic compute for Lakeflow Declarative Pipelines

This page contains instructions for configuring classic compute for Lakeflow Declarative Pipelines. For a reference of the JSON schema, see the clusters definition in the Pipeline API reference.

To create a pipeline that runs on classic compute, users must first have permission to deploy classic compute, either unrestricted creation permission or access to a compute policy. Serverless pipelines do not require compute creation permissions. By default, all workspace users can use serverless pipelines.

note

Because the Lakeflow Declarative Pipelines runtime manages the lifecycle of pipeline compute and runs a custom version of Databricks Runtime, you cannot manually set some compute settings in a pipeline configuration, such as the Spark version or cluster names. See Cluster attributes that are not user settable.

Select a compute policy

Workspace admins can configure compute policies to provide users with access to classic compute resources for Lakeflow Declarative Pipelines. Compute policies are optional. Check with your workspace administrator if you lack the compute privileges required for Lakeflow Declarative Pipelines.See Define limits on Lakeflow Declarative Pipelines compute.

When using the Pipelines API, to ensure that compute policy default values are correctly applied, set "apply_policy_default_values": true in the clusters definition:

JSON
{
"clusters": [
{
"label": "default",
"policy_id": "<policy-id>",
"apply_policy_default_values": true
}
]
}

Configure compute tags

You can add custom tags to your pipeline's classic compute resources. Tags allow you to monitor the cost of compute resources used by various groups in your organization. Databricks applies these tags to cloud resources and to usage logs recorded in the usage system tables. You can add tags using the Cluster tags UI setting or by editing the JSON configuration of your pipeline.

Select instance types to run a pipeline

By default, Lakeflow Declarative Pipelines selects the instance types for your pipeline's driver and worker nodes. You can optionally configure the instance types. For example, select instance types to improve pipeline performance or address memory issues when running your pipeline.

To configure instance types when you create or edit a pipeline in the Lakeflow Declarative Pipelines UI:

  1. Click the Settings button.
  2. In the Advanced section of the pipeline settings, in the Worker type and Driver type drop-down menus, select the instance types for the pipeline.

Configure separate settings for the update and maintenance clusters

Each declarative pipeline has two associated compute resources: an update cluster that processes pipeline updates and a maintenance cluster that runs daily maintenance tasks (including predictive optimization). By default, your compute configurations apply to both of these clusters. Using the same settings for both clusters improves the reliability of maintenance runs by ensuring that required configurations such as data access credentials for a storage location are applied to the maintenance cluster.

To apply settings to only one of the two clusters, add the label field to the setting JSON object. There are three possible values for the label field:

  • maintenance: Applies the setting only to the maintenance cluster.
  • updates: Applies the setting only to the update cluster.
  • default: Applies the setting to both the update and maintenance clusters. This is the default value if the label field is omitted.

If there is a conflicting setting, the setting with the updates or maintenance label overrides the setting defined with the default label.

note

The daily maintenance cluster is used only in certain cases:

  • Pipelines stored in Hive metastore.
  • Pipelines in workspaces that have not accepted the serverless compute terms of service. If you need assistance accepting the terms, contact your Databricks representative.

Example: Define a setting for the update cluster

The following example defines a Spark configuration parameter that is added only to the configuration for the updates cluster:

JSON
{
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5,
"mode": "ENHANCED"
}
},
{
"label": "updates",
"spark_conf": {
"key": "value"
}
}
]
}

Example: Configure instance types for the update cluster

To avoid assigning unnecessary resources to the maintenance cluster, this example uses the updates label to set the instance types for only the updates cluster.

JSON
{
"clusters": [
{
"label": "updates",
"node_type_id": "n1-highmem-16",
"driver_node_type_id": "n1-standard-4",
"...": "..."
}
]
}

Delay compute shutdown

To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

JSON
{
"configuration": {
"pipelines.clusterShutdown.delay": "60s"
}
}

When production mode is enabled, the default value for pipelines.clusterShutdown.delay is 0 seconds. When development mode is enabled, the default value is 2 hours.

note

Because Lakeflow Declarative Pipelines compute resource automatically shut down when not in use, you cannot use a compute policy that sets autotermination_minutes. This results in an error.

Create a single node compute

A single node compute has a driver node that acts as both master and worker. This is intended for workloads that use small amounts of data or are not distributed.

To create a single-node compute, set num_workers to 0. For example:

JSON
{
"clusters": [
{
"num_workers": 0
}
]
}