What is Delta Live Tables?
Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.
Instead of defining your data pipelines using a series of separate Apache Spark tasks, you define streaming tables and materialized views that the system should create and keep up to date. Delta Live Tables manages how your data is transformed based on queries you define for each processing step. You can also enforce data quality with Delta Live Tables expectations, which allow you to define expected data quality and specify how to handle records that fail those expectations.
What are Delta Live Tables datasets?
Delta Live Tables datasets are the streaming tables, materialized views, and views maintained as the results of declarative queries. The following table describes how each dataset is processed:
How are records processed through defined queries?
Each record is processed exactly once. This assumes an append-only source.
Records are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, and for change data capture processing (CDC).
Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets.
The following sections provide more detailed descriptions of each dataset type. To learn more about selecting dataset types to implement your data processing requirements, see When to use views, materialized views, and streaming tables.
A streaming table is a Delta table with extra support for streaming or incremental data processing. Streaming tables allow you to process a growing dataset, handling each row only once. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency. Streaming tables can also be useful for massive scale transformations, as results can be incrementally calculated as new data arrives, keeping results up to date without needing to fully recompute all source data with each update. Streaming tables are designed for data sources that are append-only.
Although, by default, streaming tables require append-only data sources, when a streaming source is another streaming table that requires updates or deletes, you can override this behavior with the skipChangeCommits flag.
A materialized view (or live table) is a view where the results have been precomputed. Materialized views are refreshed according to the update schedule of the pipeline in which they’re contained. Materialized views are powerful because they can handle any changes in the input. Each time the pipeline updates, query results are recalculated to reflect changes in upstream datasets that might have occurred because of compliance, corrections, aggregations, or general CDC. Delta Live Tables implements materialized views as Delta tables, but abstracts away complexities associated with efficient application of updates, allowing users to focus on writing queries.
All views in Databricks compute results from source datasets as they are queried, leveraging caching optimizations when available. Delta Live Tables does not publish views to the catalog, so views can be referenced only within the pipeline in which they are defined. Views are useful as intermediate queries that should not be exposed to end users or systems. Databricks recommends using views to enforce data quality constraints or transform and enrich datasets that drive multiple downstream queries.
Declare your first datasets in Delta Live Tables
Delta Live Tables introduces new syntax for Python and SQL. To get started with Delta Live Tables syntax, use one of the following tutorials:
Tutorial: Declare a data pipeline with SQL in Delta Live Tables
Tutorial: Declare a data pipeline with Python in Delta Live Tables
Delta Live Tables separates dataset definitions from update processing, and Delta Live Tables notebooks are not intended for interactive execution. See What is a Delta Live Tables pipeline?.
What is a Delta Live Tables pipeline?
A pipeline is the main unit used to configure and run data processing workflows with Delta Live Tables.
A pipeline contains materialized views and streaming tables declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order. For each dataset, Delta Live Tables compares the current state with the desired state and proceeds to create or update datasets using efficient processing methods.
The settings of Delta Live Tables pipelines fall into two broad categories:
Configurations that define a collection of notebooks or files (known as source code or libraries) that use Delta Live Tables syntax to declare datasets.
Configurations that control pipeline infrastructure, how updates are processed, and how tables are saved in the workspace.
Most configurations are optional, but some require careful attention, especially when configuring production pipelines. These include the following:
To make data available outside the pipeline, you must declare a target schema to publish to the Hive metastore or a target catalog and target schema to publish to Unity Catalog.
Data access permissions are configured through the cluster used for execution. Make sure your cluster has appropriate permissions configured for data sources and the target storage location, if specified.
For details on using Python and SQL to write source code for pipelines, see Delta Live Tables SQL language reference and Delta Live Tables Python language reference.
For more on pipeline settings and configurations, see Configure pipeline settings for Delta Live Tables.
Deploy your first pipeline and trigger updates
Before processing data with Delta Live Tables, you must configure a pipeline. Once a pipeline is configured, you can trigger an update to calculate results for each dataset in your pipeline. To get started using Delta Live Tables pipelines, see Tutorial: Run your first Delta Live Tables pipeline.
What is a pipeline update?
Pipelines deploy infrastructure and recompute data state when you start an update. An update does the following:
Starts a cluster with the correct configuration.
Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.
Creates or updates tables and views with the most recent data available.
Pipelines can be run continuously or on a schedule depending on your use case’s cost and latency requirements. See Run an update on a Delta Live Tables pipeline.
Ingest data with Delta Live Tables
Delta Live Tables supports all data sources available in Databricks.
Databricks recommends using streaming tables for most ingestion use cases. For files arriving in cloud object storage, Databricks recommends Auto Loader. You can directly ingest data with Delta Live Tables from most message buses.
For more information about configuring access to cloud storage, see Cloud storage configuration.
For formats not supported by Auto Loader, you can use Python or SQL to query any format supported by Apache Spark. See Load data with Delta Live Tables.
Monitor and enforce data quality
You can use expectations to specify data quality controls on the contents of a dataset. Unlike a
CHECK constraint in a traditional database which prevents adding any records that fail the constraint, expectations provide flexibility when processing data that fails data quality requirements. This flexibility allows you to process and store data that you expect to be messy and data that must meet strict quality requirements. See Manage data quality with Delta Live Tables.
How tables are created and managed by Delta Live Tables
Databricks automatically manages tables created with Delta Live Tables, determining how updates need to be processed to correctly compute the current state of a table and performing a number of maintenance and optimization tasks.
For most operations, you should allow Delta Live Tables to process all updates, inserts, and deletes to a target table. For details and limitations, see Retain manual deletes or updates.
Maintenance tasks performed by Delta Live Tables
Delta Live Tables performs maintenance tasks within 24 hours of a table being updated. Maintenance can improve query performance and reduce cost by removing old versions of tables. By default, the system performs a full OPTIMIZE operation followed by VACUUM. You can disable OPTIMIZE for a table by setting
pipelines.autoOptimize.managed = false in the table properties for the table. Maintenance tasks are performed only if a pipeline update has run in the 24 hours before the maintenance tasks are scheduled.
To ensure the maintenance cluster has the required storage location access, you must apply the security configurations required to access your storage locations to both the default and maintenance clusters. See Configure your compute settings.
The following limitations apply:
All tables created and updated by Delta Live Tables are Delta tables.
Delta Live Tables tables can only be defined once, meaning they can only be the target of a single operation in all Delta Live Tables pipelines.
Identity columns are not supported with tables that are the target of
APPLY CHANGES INTOand might be recomputed during updates for materialized views. For this reason, Databricks recommends only using identity columns with streaming tables in Delta Live Tables. See Use identity columns in Delta Lake.
A Databricks workspace is limited to 100 concurrent pipeline updates.
Delta Live Tables has full support in the Databricks REST API. See Delta Live Tables API guide.
For pipeline and table settings, see Delta Live Tables properties reference.