Configure classic compute for Lakeflow Declarative Pipelines
This page contains instructions for configuring classic compute for Lakeflow Declarative Pipelines. For a reference of the JSON schema, see the clusters
definition in the Pipeline API reference.
To create a pipeline that runs on classic compute, users must first have permission to deploy classic compute, either unrestricted creation permission or access to a compute policy. Serverless pipelines do not require compute creation permissions. By default, all workspace users can use serverless pipelines.
Because the Lakeflow Declarative Pipelines runtime manages the lifecycle of pipeline compute and runs a custom version of Databricks Runtime, you cannot manually set some compute settings in a pipeline configuration, such as the Spark version or cluster names. See Cluster attributes that are not user settable.
Select a compute policy
Workspace admins can configure compute policies to provide users with access to classic compute resources for Lakeflow Declarative Pipelines. Compute policies are optional. Check with your workspace administrator if you lack the compute privileges required for Lakeflow Declarative Pipelines.See Define limits on Lakeflow Declarative Pipelines compute.
When using the Pipelines API, to ensure that compute policy default values are correctly applied, set "apply_policy_default_values": true
in the clusters
definition:
{
"clusters": [
{
"label": "default",
"policy_id": "<policy-id>",
"apply_policy_default_values": true
}
]
}
Configure compute tags
You can add custom tags to your pipeline's classic compute resources. Tags allow you to monitor the cost of compute resources used by various groups in your organization. Databricks applies these tags to cloud resources and to usage logs recorded in the usage system tables. You can add tags using the Cluster tags UI setting or by editing the JSON configuration of your pipeline.
Select instance types to run a pipeline
By default, Lakeflow Declarative Pipelines selects the instance types for your pipeline's driver and worker nodes. You can optionally configure the instance types. For example, select instance types to improve pipeline performance or address memory issues when running your pipeline.
To configure instance types when you create or edit a pipeline in the Lakeflow Declarative Pipelines UI:
- Click the Settings button.
- In the Advanced section of the pipeline settings, in the Worker type and Driver type drop-down menus, select the instance types for the pipeline.
Configure separate settings for the update and maintenance clusters
Each declarative pipeline has two associated compute resources: an update cluster that processes pipeline updates and a maintenance cluster that runs daily maintenance tasks (including predictive optimization). By default, your compute configurations apply to both of these clusters. Using the same settings for both clusters improves the reliability of maintenance runs by ensuring that required configurations such as data access credentials for a storage location are applied to the maintenance cluster.
To apply settings to only one of the two clusters, add the label
field to the setting JSON object. There are three possible values for the label
field:
maintenance
: Applies the setting only to the maintenance cluster.updates
: Applies the setting only to the update cluster.default
: Applies the setting to both the update and maintenance clusters. This is the default value if thelabel
field is omitted.
If there is a conflicting setting, the setting with the updates
or maintenance
label overrides the setting defined with the default
label.
The daily maintenance cluster is used only in certain cases:
- Pipelines stored in Hive metastore.
- Pipelines in workspaces that have not accepted the serverless compute terms of service. If you need assistance accepting the terms, contact your Databricks representative.
Example: Define a setting for the update cluster
The following example defines a Spark configuration parameter that is added only to the configuration for the updates
cluster:
{
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5,
"mode": "ENHANCED"
}
},
{
"label": "updates",
"spark_conf": {
"key": "value"
}
}
]
}
Example: Configure instance types for the update cluster
To avoid assigning unnecessary resources to the maintenance
cluster, this example uses the updates
label to set the instance types for only the updates
cluster.
{
"clusters": [
{
"label": "updates",
"node_type_id": "n1-highmem-16",
"driver_node_type_id": "n1-standard-4",
"...": "..."
}
]
}
Delay compute shutdown
To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay
setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay
value to 60 seconds:
{
"configuration": {
"pipelines.clusterShutdown.delay": "60s"
}
}
When production
mode is enabled, the default value for pipelines.clusterShutdown.delay
is 0 seconds
. When development
mode is enabled, the default value is 2 hours
.
Because Lakeflow Declarative Pipelines compute resource automatically shut down when not in use, you cannot use a compute policy that sets autotermination_minutes
. This results in an error.
Create a single node compute
A single node compute has a driver node that acts as both master and worker. This is intended for workloads that use small amounts of data or are not distributed.
To create a single-node compute, set num_workers
to 0. For example:
{
"clusters": [
{
"num_workers": 0
}
]
}