Delta Live Tables settings

Delta Live Tables settings specify one or more notebooks that implement a pipeline and the parameters specifying how to run the pipeline in an environment, for example, development, staging, or production. Delta Live Tables settings are expressed as JSON and can be modified in the Delta Live Tables UI.

Settings

Fields

id

Type: string

A globally unique identifier for this pipeline. The identifier is assigned by the system and cannot be changed.

name

Type: string

A user-friendly name for this pipeline. The name can be used to identify pipeline jobs in the UI.

storage

Type: string

A location on DBFS or cloud storage where output data and metadata required for pipeline execution are stored. Tables and metadata are stored in subdirectories of this location.

When the storage setting is not specified, the system will default to a location in dbfs:/pipelines/.

The storage setting cannot be changed after a pipeline is created.

configuration

Type: object

An optional list of settings to add to the Spark configuration of the cluster that will run the pipeline. These settings are read by the Delta Live Tables runtime and available to pipeline queries through the Spark configuration.

Elements must be formatted as key:value pairs.

See Parameterize pipelines for an example of using the configuration object.

libraries

Type: array of objects

An array of notebooks containing the pipeline code and required artifacts. See Configure multiple notebooks in a pipeline for an example.

clusters

Type: array of objects

An array of specifications for the clusters to run the pipeline. See Cluster configuration for more detail.

If this is not specified, pipelines will automatically select a default cluster configuration for the pipeline.

continuous

Type: boolean

A flag indicating whether to run the pipeline continuously.

The default value is false.

target

Type: string

The name of a database for persisting pipeline output data. Configuring the target setting allows you to view and query the pipeline output data from the Databricks UI.

channel

Type: string

The version of the Delta Live Tables runtime to use. The supported values are:

  • preview to test your pipeline with upcoming changes to the runtime version.

  • current to use the current runtime version.

The channel field is optional. The default value is current. Databricks recommends using the current runtime version for production workloads.

edition

Type string

The Delta Live Tables product edition to run the pipeline. This setting allows you to choose the best product edition based on the requirements of your pipeline:

  • core to run streaming ingest workloads.

  • pro to run streaming ingest and change data capture (CDC) workloads.

  • advanced to run streaming ingest workloads, CDC workloads, and workloads that require Delta Live Tables expectations to enforce data quality constraints.

The edition field is optional. The default value is advanced.

photon

Type: boolean

A flag indicating whether to use Photon runtime to run the pipeline. Photon is the Databricks high performance Spark engine. Photon enabled pipelines are billed at a different rate than non-Photon pipelines.

The photon field is optional. The default value is false.

Pipelines trigger interval

You can use pipelines.trigger.interval to control the trigger interval for a flow updating a table or an entire pipeline. Because a triggered pipeline processes each table only once, the pipelines.trigger.interval is used only with continuous pipelines.

Databricks recommends setting pipelines.trigger.interval on individual tables because of different defaults for streaming versus batch queries. Set the value on a pipeline only when your processing requires controlling updates for the entire pipeline graph.

You set pipelines.trigger.interval on a table using spark_conf in Python, or SET in SQL:

@dlt.table(
  spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
    return (<query>)
SET pipelines.trigger.interval='10 seconds';

CREATE OR REFRESH LIVE TABLE TABLE_NAME
AS SELECT ...

To set pipelines.trigger.interval on a pipeline, add it to the configuration object in the pipeline settings:

{
  "configuration": {
    "pipelines.trigger.interval": "10 seconds"
  }
}

pipelines.trigger.interval

The default is based on flow type:

  • Five seconds for streaming queries.

  • One minute for complete queries when all input data is from Delta sources.

  • Ten minutes for complete queries when some data sources may be non-Delta. See Tables and views in continuous pipelines.

The value is a number plus the time unit. The following are the valid time units:

  • second, seconds

  • minute, minutes

  • hour, hours

  • day, days

You can use the singular or plural unit when defining the value, for example:

  • {"pipelines.trigger.interval" : "1 hour"}

  • {"pipelines.trigger.interval" : "10 seconds"}

  • {"pipelines.trigger.interval" : "30 second"}

  • {"pipelines.trigger.interval" : "1 minute"}

  • {"pipelines.trigger.interval" : "10 minutes"}

  • {"pipelines.trigger.interval" : "10 minute"}

Cluster configuration

You can configure clusters used by managed pipelines with the same JSON format as the create cluster API. You can specify configuration for two different cluster types: a default cluster where all processing is performed and a maintenance cluster where daily maintenance tasks are run. Each cluster is identified using the label field.

Specifying cluster properties is optional, and the system uses defaults for any missing values.

Note

  • You cannot set the Spark version in cluster configurations. Delta Live Tables clusters run on a custom version of Databricks Runtime that is continually updated to include the latest features.

  • Because a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets autotermination_minutes in your cluster configuration results in an error. To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

    {
      "configuration": {
        "pipelines.clusterShutdown.delay": "60s"
      }
    }
    
  • If you set num_workers to 0 in cluster settings, the cluster is created as a Single Node cluster. Configuring an autoscaling cluster and setting min_workers to 0 and max_workers to 0 also creates a Single Node cluster.

    If you configure an autoscaling cluster and set only min_workers to 0, then the cluster is not created as a Single Node cluster. The cluster has at least 1 active worker at all times until terminated.

    An example cluster configuration to create a Single Node cluster in Delta Live Tables:

    {
      "clusters": [
        {
          "label": "default",
          "num_workers": 0
        }
      ]
    }
    

Note

If you need an instance profile or other configuration to access your storage location, specify it for both the default cluster and the maintenance cluster.

An example configuration for a default cluster and a maintenance cluster:

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "c5.4xlarge",
      "driver_node_type_id": "c5.4xlarge",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      },
      "aws_attributes": {
        "instance_profile_arn": "arn:aws:..."
      }
    },
    {
      "label": "maintenance",
      "aws_attributes": {
        "instance_profile_arn": "arn:aws:..."
      }
    }
  ]
}

Cluster policies

Note

When using cluster policies to configure Delta Live Tables clusters, Databricks recommends applying a single policy to both the default and maintenance clusters.

To configure a cluster policy for a pipeline cluster, create a policy with the cluster_type field set to dlt:

{
  "cluster_type": {
    "type": "fixed",
    "value": "dlt"
  }
}

In the pipeline settings, set the cluster policy_id field to the value of the policy identifier. The following example configures the default and maintenance clusters using the cluster policy with the identifier C65B864F02000008.

{
  "clusters": [
    {
      "label": "default",
      "policy_id": "C65B864F02000008",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5
      }
    },
    {
      "label": "maintenance",
      "policy_id": "C65B864F02000008"
    }
  ]
}

For an example of creating and using a cluster policy, see Define limits on pipeline clusters.

Examples

Configure a pipeline and cluster

The following example configures a triggered pipeline implemented in example-notebook_1, using DBFS for storage, and running on a small one-node cluster:

{
  "name": "Example pipeline 1",
  "storage": "dbfs:/pipeline-examples/storage-location/example1",
  "clusters": [
    {
      "num_workers": 1,
      "spark_conf": {}
    }
  ],
  "libraries": [
    {
      "notebook": {
         "path": "/Users/user@databricks.com/example_notebook_1"
      }
    }
  ],
  "continuous": false
}

Configure multiple notebooks in a pipeline

Use the libraries field to configure a pipeline with multiple notebooks. You can add notebooks in any order, because Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for your pipeline. The following example creates a pipeline that includes the datasets defined in example-notebook_1 and example-notebook_2:

{
  "name": "Example pipeline 3",
  "storage": "dbfs:/pipeline-examples/storage-location/example3",
  "libraries": [
    { "notebook": { "path": "/example-notebook_1" } },
    { "notebook": { "path": "/example-notebook_2" } }
  ]
}

Create a development workflow with Delta Live Tables

You can create separate Delta Live Tables pipelines for development, staging, and production, allowing you to test and debug your transformation logic without affecting the consumers of the data you produce. Simply create separate pipelines that target different databases but use the same underlying code.

You can combine this functionality with Databricks Repos to create a fully isolated development environment and a simple workflow to push from development to production.

{
  "name": "Data Ingest - DEV user@databricks",
  "target": "customers_dev_user",
  "libraries": ["/Repos/user@databricks.com/ingestion/etl.py"],
}
{
  "name": "Data Ingest - PROD",
  "target": "customers",
  "libraries": ["/Repos/production/ingestion/etl.py"],
}

Parameterize pipelines

The Python and SQL code that defines your datasets can be parameterized by the pipeline’s settings. Parameterization enables the following use cases:

  • Separating long paths and other variables from your code.

  • Reducing the amount of data that is processed in development or staging environments to speed up testing.

  • Reusing the same transformation logic to process from multiple data sources.

The following example uses the startDate configuration value to limit the development pipeline to a subset of the input data:

CREATE OR REFRESH LIVE TABLE customer_events
AS SELECT * FROM sourceTable WHERE date > '${mypipeline.startDate}';
@dlt.table
def customer_events():
  start_date = spark.conf.get("mypipeline.startDate")
  return read("sourceTable").where(col("date") > start_date)
{
  "name": "Data Ingest - DEV",
  "configuration": {
    "mypipeline.startDate": "2021-01-02"
  }
}
{
  "name": "Data Ingest - PROD",
  "configuration": {
    "mypipeline.startDate": "2010-01-02"
  }
}