Configure pipeline settings for Delta Live Tables
This article provides details on configuring pipeline settings for Delta Live Tables. Delta Live Tables provides a user interface for configuring and editing pipeline settings. The UI also provides an option to display and edit settings in JSON.
Note
You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.
Databricks recommends familiarizing yourself with Delta Live Tables settings using the UI. If necessary, you can directly edit the JSON configuration in the workspace. JSON configuration files are also useful when deploying pipelines to new environments or when using the CLI or REST API.
For a full reference to the Delta Live Tables JSON configuration settings, see Delta Live Tables pipeline configurations.
Choose a product edition
Select the Delta Live Tables product edition with the features best suited for your pipeline requirements. The following product editions are available:
Core
to run streaming ingest workloads. Select theCore
edition if your pipeline doesn’t require advanced features such as change data capture (CDC) or Delta Live Tables expectations.Pro
to run streaming ingest and CDC workloads. ThePro
product edition supports all of theCore
features, plus support for workloads that require updating tables based on changes in source data.Advanced
to run streaming ingest workloads, CDC workloads, and workloads that require expectations. TheAdvanced
product edition supports the features of theCore
andPro
editions, and also supports enforcement of data quality constraints with Delta Live Tables expectations.
You can select the product edition when you create or edit a pipeline. You can select a different edition for each pipeline. See the Delta Live Tables product page.
Note
If your pipeline includes features not supported by the selected product edition, for example, expectations, you will receive an error message with the reason for the error. You can then edit the pipeline to select the appropriate edition.
Choose a pipeline mode
You can choose to update your pipeline continuously or with manual triggers based on the pipeline mode. See Continuous vs. triggered pipeline execution.
Select a cluster policy
Users must have permissions to deploy compute to configure and update Delta Live Tables pipelines. Workspace admins can configure cluster policies to provide users with access to compute resources for Delta Live Tables. See Define limits on Delta Live Tables pipeline clusters.
Note
Cluster policies are optional. Check with your workspace administrator if you lack compute privileges required for Delta Live Tables.
Configure source code libraries
You can use the file selector in the Delta Live Tables UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or in SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files.
Because Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code libraries in any order.
You can also modify the JSON file to include Delta Live Tables source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files from Databricks Repos:
{
"name": "Example pipeline 3",
"storage": "dbfs:/pipeline-examples/storage-location/example3",
"libraries": [
{ "notebook": { "path": "/example-notebook_1" } },
{ "notebook": { "path": "/example-notebook_2" } },
{ "file": { "path": "/Repos/<user_name>@databricks.com/Apply_Changes_Into/apply_changes_into.sql" } },
{ "file": { "path": "/Repos/<user_name>@databricks.com/Apply_Changes_Into/apply_changes_into.py" } }
]
}
Specify a storage location
You can choose to specify a storage location for a pipeline that publishes to the Hive metastore. The primary motivation for specifying a location is to control the object storage location for data written by your pipeline.
Because all tables, data, checkpoints, and metadata for Delta Live Tables pipelines are fully managed by Delta Live Tables, most interaction with Delta Live Tables datasets happens through tables registered to the Hive metastore or Unity Catalog.
Specify a target schema for pipeline output tables
While optional, you should specify a target to publish tables created by your pipeline anytime you move beyond development and testing for a new pipeline. Publishing a pipeline to a target makes datasets available for querying elsewhere in your Databricks environment. See Publish data from Delta Live Tables pipelines to the Hive metastore or Use Unity Catalog with your Delta Live Tables pipelines.
Configure your compute settings
Each Delta Live Tables pipeline has two associated clusters.
The default cluster is used to process pipeline updates.
The maintenance cluster runs daily maintenance tasks.
Compute settings in the Delta Live Tables UI primarily target the default cluster used for pipeline updates. If you specify a storage location requiring data access credentials, you must ensure that the maintenance cluster also has these permissions configured.
Delta Live Tables provides similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI. See Clusters.
Note
You cannot set the Spark version in cluster configurations. Delta Live Tables clusters run on a custom version of Databricks Runtime that is continually updated to include the latest features. Manually setting a version may result in pipeline failures.
You can configure Delta Live Tables pipelines to leverage Photon. See Photon runtime.
Use autoscaling to increase efficiency and reduce resource usage
Use Enhanced Autoscaling to optimize the cluster utilization of your pipelines. Enhanced Autoscaling adds additional resources only if the system determines those resources will increase pipeline processing speed. Resources are freed when they are no longer needed, and clusters are shut down as soon as all pipeline updates are complete.
Use the following guidelines when configuring Enhanced Autoscaling for production pipelines:
Leave the
Min workers
setting at the default.Set the
Max workers
setting to a value based on budget and pipeline priority.
Delay compute shutdown
Because a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets autotermination_minutes
in your cluster configuration results in an error. To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay
setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay
value to 60 seconds:
{
"configuration": {
"pipelines.clusterShutdown.delay": "60s"
}
}
When production
mode is enabled, the default value for pipelines.clusterShutdown.delay
is 0 seconds
. When development
mode is enabled, the default value is 2 hours
.
Create a single node cluster
If you set num_workers
to 0 in cluster settings, the cluster is created as a Single Node cluster. Configuring an autoscaling cluster and setting min_workers
to 0 and max_workers
to 0 also creates a Single Node cluster.
If you configure an autoscaling cluster and set only min_workers
to 0, then the cluster is not created as a Single Node cluster. The cluster has at least 1 active worker at all times until terminated.
An example cluster configuration to create a Single Node cluster in Delta Live Tables:
{
"clusters": [
{
"label": "default",
"num_workers": 0
}
]
}
Configure cluster tags
You can use cluster tags to monitor usage for your pipeline clusters. Add cluster tags in the Delta Live Tables UI when you create or edit a pipeline, or by editing the JSON settings for your pipeline clusters.
Cloud storage configuration
You use AWS instance profiles to configure access to S3 storage in AWS. To add an instance profile in the Delta Live Tables UI, click Advanced when you create or edit a pipeline and select an instance profile in the Instance profile dropdown menu.
You can also configure an AWS instance profile by editing the JSON settings for your pipeline clusters when you create or edit a pipeline with the Delta Live Tables API or in the Delta Live Tables UI:
On the Pipeline details page for your pipeline, click the Settings button. The Pipeline settings page appears.
Click the JSON button.
Enter the instance profile configuration in the
aws_attributes.instance_profile_arn
field in the cluster configuration:
{
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:..."
}
},
{
"label": "maintenance",
"aws_attributes": {
"instance_profile_arn": "arn:aws:..."
}
}
]
}
When configuring an instance profile in the JSON settings, you must specify the instance profile configuration for the default and maintenance clusters.
You can also configure instance profiles when you create cluster policies for your Delta Live Tables pipelines. For an example, see the knowledge base.
Parameterize pipelines
The Python and SQL code that defines your datasets can be parameterized by the pipeline’s settings. Parameterization enables the following use cases:
Separating long paths and other variables from your code.
Reducing the amount of data processed in development or staging environments to speed up testing.
Reusing the same transformation logic to process from multiple data sources.
The following example uses the startDate
configuration value to limit the development pipeline to a subset of the input data:
CREATE OR REFRESH LIVE TABLE customer_events
AS SELECT * FROM sourceTable WHERE date > '${mypipeline.startDate}';
@dlt.table
def customer_events():
start_date = spark.conf.get("mypipeline.startDate")
return read("sourceTable").where(col("date") > start_date)
{
"name": "Data Ingest - DEV",
"configuration": {
"mypipeline.startDate": "2021-01-02"
}
}
{
"name": "Data Ingest - PROD",
"configuration": {
"mypipeline.startDate": "2010-01-02"
}
}
Pipelines trigger interval
You can use pipelines.trigger.interval
to control the trigger interval for a flow updating a table or an entire pipeline. Because a triggered pipeline processes each table only once, the pipelines.trigger.interval
is used only with continuous pipelines.
Databricks recommends setting pipelines.trigger.interval
on individual tables because of different defaults for streaming versus batch queries. Set the value on a pipeline only when your processing requires controlling updates for the entire pipeline graph.
You set pipelines.trigger.interval
on a table using spark_conf
in Python, or SET
in SQL:
@dlt.table(
spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
return (<query>)
SET pipelines.trigger.interval='10 seconds';
CREATE OR REFRESH LIVE TABLE TABLE_NAME
AS SELECT ...
To set pipelines.trigger.interval
on a pipeline, add it to the configuration
object in the pipeline settings:
{
"configuration": {
"pipelines.trigger.interval": "10 seconds"
}
}
Add email notifications for pipeline events
You can configure one or more email addresses to receive notifications when the following occurs:
A pipeline update completes successfully.
Each time a pipeline update fails with a retryable error.
A pipeline update fails with a non-retryable (fatal) error.
A single data flow fails.