Skip to main content

Develop DLT pipelines

Developing and testing pipeline code differs from other Apache Spark workloads. This article provides an overview of supported functionality, best practices, and considerations when developing pipeline code. For more recommendations and best practices, see Applying software development & DevOps best practices to DLT pipelines.

note

You must add source code to a pipeline configuration to validate code or run an update. See Configure a DLT pipeline.

What files are valid for pipeline source code?

DLT pipeline code can be Python or SQL. You can have a mix of Python and SQL source code files backing a single pipeline, but each file can only contain one language. See Develop pipeline code with Python and Develop pipeline code with SQL.

You can use notebooks and workspace files when specifying source code for a pipeline. Workspace files represent Python or SQL scripts authored in your preferred IDE or the Databricks file editor. See What are workspace files?.

If you develop Python code as modules or libraries, you must install and import the code and then call methods from a Python notebook or workspace file configured as source code. See Manage Python dependencies for DLT pipelines.

note

If you need to use arbitrary SQL commands in a Python notebook, you can use the syntax pattern spark.sql("<QUERY>") to run SQL as Python code.

Unity Catalog functions allow you to register arbitrary Python user-defined functions for use in SQL. See User-defined functions (UDFs) in Unity Catalog.

Overview of DLT development features

DLT extends and leverages many Databricks features, and introduces new features and concepts. The following table provides a brief overview of concepts and features that support pipeline code development:

Feature

Description

Development mode

New pipelines are configured to run in development mode by default. Databricks recommends using development mode for interactive development and testing. See Development and production modes.

Validate

A Validate update verifies the correctness of pipeline source code without running an update on any tables. See Check a pipeline for errors without waiting for tables to update.

Notebooks

Notebooks configured as source code for a DLT pipeline provide interactive options for validating code and running updates. See Develop and debug DLT pipelines in notebooks.

Parameters

Leverage parameters in source code and pipeline configurations to simplify testing and extensibility. See Use parameters with DLT pipelines.

Databricks Asset Bundles

Databricks Asset Bundles allow you to move pipeline configurations and source code between workspaces. See Convert a DLT pipeline into a Databricks Asset Bundle project.

Create sample datasets for development and testing

Databricks recommends creating development and test datasets to test pipeline logic with expected data and potentially malformed or corrupt records. There are multiple ways to create datasets that can be useful for development and testing, including the following:

  • Select a subset of data from a production dataset.
  • Use anonymized or artificially generated data for sources containing PII.
  • Create test data with well-defined outcomes based on downstream transformation logic.
  • Anticipate potential data corruption, malformed records, and upstream data changes by creating records that break data schema expectations.

For example, if you have a notebook that defines a dataset using the following code:

SQL
CREATE OR REFRESH STREAMING TABLE input_data
AS SELECT * FROM STREAM read_files(
"/production/data",
format => "json")

You could create a sample dataset containing specific records using a query like the following:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS
SELECT "2021/09/04" AS date, 22.4 as sensor_reading UNION ALL
SELECT "2021/09/05" AS date, 21.5 as sensor_reading

The following example demonstrates filtering published data to create a subset of the production data for development or testing:

SQL
CREATE OR REFRESH MATERIALIZED VIEW input_data AS SELECT * FROM prod.input_data WHERE date > current_date() - INTERVAL 1 DAY

To use these different datasets, create multiple pipelines with the notebooks implementing the transformation logic. Each pipeline can read data from the input_data dataset but is configured to include the notebook that creates the dataset specific to the environment.

How do DLT datasets process data?

The following table describes how materialized views, streaming tables, and views process data:

Dataset type

How are records processed through defined queries?

Streaming table

Each record is processed exactly once. This assumes an append-only source.

Materialized view

Records are processed as required to return accurate results for the current data state. Materialized views should be used for data processing tasks such as transformations, aggregations, or pre-computing slow queries and frequently used computations.

View

Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.

Declare your first datasets in DLT

DLT introduces new syntax for Python and SQL. To learn the basics of pipeline syntax, see Develop pipeline code with Python and Develop pipeline code with SQL.

note

DLT separates dataset definitions from update processing, and DLT notebooks are not intended for interactive execution.

How do you configure DLT pipelines?

The settings for DLT pipelines fall into two broad categories:

  1. Configurations that define a collection of notebooks or files (known as source code) that use DLT syntax to declare datasets.
  2. Configurations that control pipeline infrastructure, dependency management, how updates are processed, and how tables are saved in the workspace.

Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:

  • To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
  • Data access permissions are configured through the cluster used for execution. Ensure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.

For details on using Python and SQL to write source code for pipelines, see DLT SQL language reference and DLT Python language reference.

For more on pipeline settings and configurations, see Configure a DLT pipeline.

Deploy your first pipeline and trigger updates

Before processing data with DLT, you must configure a pipeline. After a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using DLT pipelines, see Tutorial: Run your first DLT pipeline.

What is a pipeline update?

Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:

  • Starts a cluster with the correct configuration.
  • Discovers all the tables and views defined and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
  • Creates or updates tables and views with the most recent data available.

Pipelines can be run continuously or on a schedule depending on your use case’s cost and latency requirements. See Run an update on a DLT pipeline.

Ingest data with DLT

DLT supports all data sources available in Databricks.

Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with DLT from most message buses.

For more information about configuring access to cloud storage, see Cloud storage configuration.

For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data with DLT.

Monitor and enforce data quality

You can use expectations to specify data quality controls on the contents of a dataset. Unlike a CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with pipeline expectations.

DLT extends the functionality of Delta Lake. Because tables created and managed by DLT are Delta tables, they have the same guarantees and features provided by Delta Lake. See What is Delta Lake?.

DLT adds several table properties in addition to the many table properties that can be set in Delta Lake. See DLT properties reference and Delta table properties reference.

How tables are created and managed by DLT

Databricks automatically manages tables created with DLT, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.

For most operations, you should allow DLT to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.

Maintenance tasks performed by DLT

DLT performs maintenance tasks within 24 hours of a table being updated. Maintenance can improve query performance and reduce cost by removing old versions of tables. By default, the system performs a full OPTIMIZE operation followed by VACUUM. You can disable OPTIMIZE for a table by setting pipelines.autoOptimize.managed = false in the table properties for the table. Maintenance tasks are performed only if a pipeline update has run in the 24 hours before the maintenance tasks are scheduled.

Delta Live Tables is now DLT

The product formerly known as Delta Live Tables is now DLT.

Limitations

For a list of limitations, see DLT Limitations.

For a list of requirements and limitations that are specific to using DLT with Unity Catalog, see Use Unity Catalog with your DLT pipelines

Additional resources