Configure Lakeflow Declarative Pipelines
This article describes the basic configuration for Lakeflow Declarative Pipelines using the workspace UI.
The configuration instructions in this article use Unity Catalog. For instructions for configuring pipelines with legacy Hive metastore, see Use Lakeflow Declarative Pipelines with legacy Hive metastore.
This article discusses functionality for the current default publishing mode for pipelines. Pipelines created before February 5, 2025, might use the legacy publishing mode and LIVE
virtual schema. See LIVE schema (legacy).
The UI has an option to display and edit settings in JSON. You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.
JSON configuration files are also helpful when deploying pipelines to new environments or using the CLI or REST API.
For a complete reference to the Lakeflow Declarative Pipelines JSON configuration settings, see Lakeflow Declarative Pipelines configurations.
Configure a new pipeline
To configure a new pipeline, do the following:
-
At the top of the sidebar, click
New and then select
ETL pipeline.
-
At the top, give your pipeline a unique name.
-
Under the name, you can see the default catalog and schema that have been chosen for you. Change these to give your pipeline different defaults.
The default catalog and the default schema are where datasets are read from or written to when you do not qualify datasets with a catalog or schema in your code. See Database objects in Databricks for more information.
-
Select your preferred option to create a pipeline:
- Start with sample code in SQL to create a new pipeline and folder structure, including sample code in SQL.
- Start with sample code in Python to create a new pipeline and folder structure, including sample code in Python.
- Start with a single transformation to create a new pipeline and folder structure, with a new blank code file.
- Add existing assets to create a pipeline that you can associate with existing code files in your workspace.
- Create a source-controlled project to create a pipeline with a new Databricks Asset Bundles project, or to add the pipeline to an existing bundle.
You can have both SQL and Python source code files in your ETL pipeline. When creating a new pipeline and choosing a language for the sample code, the language is only for the sample code included in your pipeline by default.
-
When you make your selection, you are redirected to the newly created pipeline.
The ETL pipeline is created with the following default settings:
- Unity Catalog
- Current channel
- Serverless compute
- Development mode off. This setting only affects scheduled runs of the pipeline. Running the pipeline from the editor always defaults to using development mode.
This configuration is recommended for many use cases, including development and testing, and is well-suited to production workloads that should run on a schedule. For details on scheduling pipelines, see Pipeline task for jobs.
You can adjust these settings from the pipeline toolbar.
Alternatively, you can create an ETL pipeline from the workspace browser:
- Click Workspace in the left side panel.
- Select any folder, including Git folders.
- Click Create in the upper-right corner, and click ETL pipeline.
You can also create an ETL pipeline from the jobs and pipelines page:
- In your workspace, click
Jobs & Pipelines in the sidebar.
- Under New, click ETL Pipeline.
Compute configuration options
Databricks recommends always using Enhanced autoscaling. Default values for other compute configurations work well for many pipelines.
Use the following settings to customize compute configurations:
- Workspace admins can configure a Cluster policy. Compute policies allow admins to control what compute options are available to users. See Select a compute policy.
- You can optionally configure Cluster mode to run with Fixed size or Legacy autoscaling. See Optimize the cluster utilization of Lakeflow Declarative Pipelines with Autoscaling.
- For workloads with autoscaling enabled, set Min workers and Max workers to set limits for scaling behaviors. See Configure classic compute for Lakeflow Declarative Pipelines.
- You can optionally turn off Photon acceleration. See What is Photon?.
- Use Cluster tags to help monitor costs associated with Lakeflow Declarative Pipelines. See Configure compute tags.
- Configure Instance types to specify the type of virtual machines used to run your pipeline. See Select instance types to run a pipeline.
- Select a Worker type optimized for the workloads configured in your pipeline.
- You can optionally select a Driver type that differs from your worker type. This can be useful for reducing costs in pipelines with large worker types and low driver compute utilization or for choosing a larger driver type to avoid out-of-memory issues in workloads with many small workers.
Set the run-as user
Run-as user allows you to change the identity that a pipeline uses to run, and the ownership of the tables it creates or updates. This is useful in situations where the original user who created the pipeline has been deactivated—for example, if they left the company. In those cases, the pipeline can stop working, and the tables it published can become inaccessible to others. By updating the pipeline to run as a different identity—such as a service principal—and reassigning ownership of the published tables, you can restore access and ensure the pipeline continues to function. Running pipelines as service principals is considered a best practice because they are not tied to individual users, making them more secure, stable, and reliable for automated workloads.
Required permissions
For the user making the change:
- CAN_MANAGE permissions on the pipeline
- CAN_USE role on the service principal (if setting run-as to a service principal)
For the run-as user or service principal:
-
Workspace Access:
- Workspace access permission to operate within the workspace
- Can use permission on cluster policies used by the pipeline
- Compute creation permission in the workspace
-
Source Code Access:
- Can read permission on all notebooks included in the pipeline source code
- Can read permission on workspace files if the pipeline uses them
-
Unity Catalog Permissions (for pipelines using Unity Catalog):
USE CATALOG
on the target catalogUSE SCHEMA
andCREATE TABLE
on the target schemaMODIFY
permission on existing tables that the pipeline updatesCREATE SCHEMA
permission if the pipeline creates new schemas
-
Legacy Hive metastore Permissions (for pipelines using Hive metastore):
SELECT
andMODIFY
permissions on target databases and tables
-
Additional Cloud Storage Access (if applicable):
- Permissions to read from source storage locations
- Permissions to write to target storage locations
How to set the run-as user
You can set the run-as
user through the pipeline settings from the pipeline monitoring page or the pipeline editor. To change the user from the pipeline monitoring page:
- Click Jobs & Pipelines to open the list of pipelines, and select the name of the pipeline you wish to edit.
- On the pipline monitoring page, click Settings.
- In the Pipeline settings sidebar, click
Edit next to Run as.
- In the edit widget, select one of the following options:
- Your own user account
- A service principal for which you have CAN_USE permission
- Click Save to apply the changes.
When you successfully update the run-as user:
- The pipeline identity changes to use the new user or service principal for all future runs
- In Unity Catalog pipelines, the owner of tables published by the pipeline is updated to match the new run-as identity
- Future pipeline updates will use the permissions and credentials of the new run-as identity
- Continuous pipelines automatically restart with the new identity. Triggered pipelines do not automatically restart, and the run-as change can interrupt an active update
If the update of run-as fails, you receive an error message explaining the reason for the failure. Common issues include insufficient permissions on the service principal.
Other configuration considerations
The following configuration options are also available for pipelines:
- The Advanced product edition gives you access to all Lakeflow Declarative Pipelines features. You can optionally run pipelines using the Pro or Core product editions. See Choose a product edition.
- You might choose to use the Continuous pipeline mode when running pipelines in production. See Triggered vs. continuous pipeline mode.
- If your workspace is not configured for Unity Catalog or your workload needs to use legacy Hive metastore, see Use Lakeflow Declarative Pipelines with legacy Hive metastore.
- Add Notifications for email updates based on success or failure conditions. See Add email notifications for pipeline events.
- Use the Configuration field to set key-value pairs for the pipeline. These configurations serve two purposes:
- Set arbitrary parameters you can reference in your source code. See Use parameters with Lakeflow Declarative Pipelines.
- Configure pipeline settings and Spark configurations. See Lakeflow Declarative Pipelines properties reference.
- Configure Tags. Tags are key-value pairs for the pipeline that are visible in the Workflows list. Pipeline tags are not associated with billing.
- Use the Preview channel to test your pipeline against pending Lakeflow Declarative Pipelines runtime changes and trial new features.
Choose a product edition
Select the Lakeflow Declarative Pipelines product edition with the best features for your pipeline requirements. The following product editions are available:
Core
to run streaming ingest workloads. Select theCore
edition if your pipeline doesn't require advanced features such as change data capture (CDC) or Lakeflow Declarative Pipelines expectations.Pro
to run streaming ingest and CDC workloads. ThePro
product edition supports all of theCore
features, plus support for workloads that require updating tables based on changes in source data.Advanced
to run streaming ingest workloads, CDC workloads, and workloads that require expectations. TheAdvanced
product edition supports the features of theCore
andPro
editions and includes data quality constraints with Lakeflow Declarative Pipelines expectations.
You can select the product edition when you create or edit a pipeline. You can choose a different edition for each pipeline. See the Lakeflow Declarative Pipelines product page.
Note: If your pipeline includes features not supported by the selected product edition, such as expectations, you will receive an error message explaining the reason for the error. You can then edit the pipeline to select the appropriate edition.
Configure source code
You can use the asset browser in the Lakeflow Pipelines Editor to configure the source code defining your pipeline. Pipeline source code is defined in SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more files. By default, pipeline source code is located in the transformations
folder in your pipeline's root folder.
Because Lakeflow Declarative Pipelines automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code assets in any order.
For more details on using the Lakeflow Pipelines Editor, see Develop and debug ETL pipelines with the Lakeflow Pipelines Editor.
Manage external dependencies for pipelines that use Python
Lakeflow Declarative Pipelines supports using external dependencies in your pipelines, such as Python packages and libraries. To learn about options and recommendations for using dependencies, see Manage Python dependencies for Lakeflow Declarative Pipelines.
Use Python modules stored in your Databricks workspace
In addition to implementing your Python code in pipeline source code files, you can use Databricks Git Folders or workspace files to store your code as Python modules. Storing your code as Python modules is especially useful when you have common functionality you want to use in multiple pipelines or notebooks in the same pipeline. To learn how to use Python modules with your pipelines, see Import Python modules from Git folders or workspace files.