How to develop and test Delta Live Tables pipelines

This article describes patterns you can use to develop and test Delta Live Tables pipelines. Through the pipeline settings, Delta Live Tables allows you to specify configurations to isolate pipelines in developing, testing, and production environments. The recommendations in this article are applicable for both SQL and Python code development.

Use development mode to run pipeline updates

Delta Live Tables provides a UI toggle to control whether your pipeline updates run in development or production mode. This mode controls how pipeline updates are processed, including:

  • Development mode does not immediately terminate compute resources after an update succeeds or fails. You can reuse the same compute resources to run multiple updates of the pipeline without waiting for a cluster to start.

  • Development mode does not automatically retry on task failure, allowing you to immediately detect and fix logical or syntactic errors in your pipeline.

Databricks recommends using development mode during development and testing and always switching to production mode when deploying to a production environment.

See Development and production modes.

Specify a target schema during all development lifecycle phases

All datasets in a Delta Live Tables pipeline reference the LIVE virtual schema, which is not accessible outside the pipeline. If a target schema is specified, the LIVE virtual schema points to the target schema. To review the results written out to each table during an update, you must specify a target schema.

You must specify a target schema that is unique to your environment. Each table in a given schema can only be updated by a single pipeline.

By creating separate pipelines for development, testing, and production with different targets, you can keep these environments isolated. Using the target schema parameter allows you to remove logic that uses string interpolation or other widgets or parameters to control data sources and targets.

See Publish data from Delta Live Tables pipelines to the Hive metastore.

Use Databricks Repos to manage Delta Live Tables pipelines

Databricks recommends using Repos during Delta Live Tables pipeline development, testing, and deployment to production. Repos enables the following:

  • Keeping track of how code is changing over time.

  • Merging changes that are being made by multiple developers.

  • Software development practices such as code reviews.

Databricks recommends configuring a single Git repository for all code related to a pipeline.

Each developer should have their own Databricks Repo configured for development. During development, the user configures their own pipeline from their Databricks Repo and tests new logic using development datasets and isolated schema and locations. As development work is completed, the user commits and pushes changes back to their branch in the central Git repository and opens a pull request against the testing or QA branch.

The resulting branch should be checked out in a Databricks Repo and a pipeline configured using test datasets and a development schema. Assuming logic runs as expected, a pull request or release branch should be prepared to push the changes to production.

While Repos can be used to synchronize code across environments, pipeline settings need to be kept up to date either manually or using tools like Terraform.

This workflow is similar to using Repos for CI/CD in all Databricks jobs. See CI/CD techniques with Git and Databricks Repos.

Segment libraries for ingestion and transformation steps

Databricks recommends isolating queries that ingest data from transformation logic that enriches and validates data. You can then organize libraries used for ingesting data from development or testing data sources in a separate directory from production data ingestion logic, allowing you to easily configure pipelines for various environments. You can then use smaller datasets for testing, accelerating development. See Create sample datasets for development and testing.

You can also use parameters to control data sources for development, testing, and production. See Control data sources with parameters.

Because Delta Live Tables pipelines use the LIVE virtual schema for managing all dataset relationships, by configuring development and testing pipelines with ingestion libraries that load sample data, you can substitute sample datasets using production table names to test code. The same transformation logic can be used in all environments.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with both expected data and potential malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

  • Select a subset of data from a production dataset.

  • Use anonymized or artificially generated data for sources containing PII.

  • Create test data with well-defined outcomes based on downstream transformation logic.

  • Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a notebook that defines a dataset using the following code:

CREATE OR REFRESH STREAMING TABLE input_data AS SELECT * FROM cloud_files("/production/data", "json")

You could create a sample dataset containing specific records using a query like the following:

SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

CREATE OR REFRESH LIVE TABLE input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the LIVE.input_data dataset but is configured to include the notebook that creates the dataset specific to the environment.

Control data sources with parameters

You can reference parameters set during pipeline configuration from within your libraries. These parameters are set as key-value pairs in the Compute > Advanced > Configurations portion of the pipeline settings UI. This pattern allows you to specify different data sources in different configurations of the same pipeline.

For example, you can specify different paths in development, testing, and production configurations for a pipeline using the variable data_source_path and then reference it using the following code:

AS (
    _metadata.file_path AS source_file_path
    FROM cloud_files( '${data_source_path}', 'csv',
            map("header", "true"))
import dlt
from pyspark.sql.functions import col

data_source_path = spark.conf.get("data_source_path")

def bronze():
    return (spark.readStream
        .option("cloudFiles.format", "csv")
        .option("header", True)
        .load(data_source_path )
        .select("*", col("_metadata.file_path").alias("source_file_name"))

This pattern is especially useful if you need to test how ingestion logic might handle changes to schema or malformed data during initial ingestion. You can use the identical code throughout your entire pipeline in all environments while switching out datasets.