Tutorial: Declare a data pipeline with SQL in Delta Live Tables
This tutorial shows you how to use SQL syntax to declare a data pipeline with Delta Live Tables. Databricks recommends Delta Live Tables with SQL as the preferred way for SQL users to build new ETL, ingestion, and transformation pipelines on Databricks. SQL syntax for Delta Live Tables extends standard Spark SQL with many new keywords, constructs, and table-valued functions. These additions to standard SQL allow users to declare dependencies between datasets and deploy production-grade infrastructure without needing to learn any new tooling or additional concepts.
For users familiar with Spark DataFrames that desire extensive testing and support for metaprogramming operations, Databricks recommends using Python for Delta Live Tables. See Tutorial: Declare a data pipeline with Python in Delta Live Tables.
Note
You cannot mix languages within a Delta Live Tables source file. You can use multiple notebooks or files with different languages in a pipeline.
Where do you run Delta Live Tables SQL queries?
You must add your SQL files to a pipeline configuration to process query logic. To learn about executing logic defined in Delta Live Tables, see Tutorial: Run your first Delta Live Tables pipeline.
While you can use notebooks or SQL files to write Delta Live Tables SQL queries, Delta Live Tables is not designed to run interactively in notebook cells. Executing a cell that contains Delta Live Tables syntax in a Databricks notebook returns a message about whether the query is syntactically valid, but does not run query logic.
Declare a Delta Live Tables pipeline with SQL
This tutorial uses SQL syntax to declare a Delta Live Tables pipeline on a dataset containing Wikipedia clickstream data to:
Read the raw JSON clickstream data into a table.
Read the records from the raw data table and use Delta Live Tables expectations to create a new table that contains cleansed data.
Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.
This code demonstrates a simplified example of the medallion architecture. See What is the medallion lakehouse architecture?.
Copy the SQL code and paste it into a new notebook. You can add the example code to a single cell of the notebook or multiple cells. To review options for creating notebooks, see Create a notebook.
Create a table from files in object storage
Delta Live Tables supports loading data from all formats supported by Databricks. See Interact with external data on Databricks.
All Delta Live Tables SQL statements use CREATE OR REFRESH
syntax and semantics. When you update a pipeline, Delta Live Tables determines whether the logically correct result for the table can be accomplished through incremental processing or if full recomputation is required.
The following example creates a table by loading data from JSON files stored in object storage:
CREATE OR REFRESH LIVE TABLE clickstream_raw
COMMENT "The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
AS SELECT * FROM json.`/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json`;
Add a table from an upstream dataset to the pipeline
You can use the live
virtual schema to query data from other datasets declared in your current Delta Live Tables pipeline. Declaring new tables in this way creates a dependency that Delta Live Tables automatically resolves before executing updates. The live
schema is a custom keyword implemented in Delta Live Tables that can be substituted for a target schema if you wish to publish your datasets. See Publish data from Delta Live Tables pipelines.
The following code also includes examples of monitoring and enforcing data quality with expectations. See Manage data quality with Delta Live Tables.
CREATE OR REFRESH LIVE TABLE clickstream_prepared(
CONSTRAINT valid_current_page EXPECT (current_page_title IS NOT NULL),
CONSTRAINT valid_count EXPECT (click_count > 0) ON VIOLATION FAIL UPDATE
)
COMMENT "Wikipedia clickstream data cleaned and prepared for analysis."
AS SELECT
curr_title AS current_page_title,
CAST(n AS INT) AS click_count,
prev_title AS previous_page_title
FROM live.clickstream_raw;
Create an enriched data view
Because Delta Live Tables processes updates to pipelines as a series of dependency graphs, you can declare highly enriched views that power dashboards, BI, and analytics by declaring tables with specific business logic.
Live tables are equivalent conceptually to materialized views. Whereas traditional views on Spark execute logic each time the view is queried, live tables store the most recent version of query results in data files. Because Delta Live Tables manages updates for all datasets in a pipeline, you can schedule pipeline updates to match latency requirements for materialized views and know that queries against these tables contain the most recent version of data available.
The following code creates an enriched materialized view of upstream data:
CREATE OR REFRESH LIVE TABLE top_spark_referers
COMMENT "A table containing the top pages linking to the Apache Spark page."
AS SELECT
previous_page_title as referrer,
click_count
FROM live.clickstream_prepared
WHERE current_page_title = 'Apache_Spark'
ORDER BY click_count DESC
LIMIT 10;
Example using other Delta Live Tables options
Delta Live Tables materialized views and streaming tables support other options not shown in the examples above. The following example specifies the schema for the target table, including using Delta Lake generated columns. Partition columns for the target table are also defined.
Note
For tables less than 1 TB in size, Databricks recommends letting Delta Live Tables control data organization. Unless you expect your table to grow beyond a terabyte, you should generally not specify partition columns.
CREATE OR REFRESH LIVE TABLE sales
(customer_id STRING,
customer_name STRING,
number_of_line_items STRING,
order_datetime STRING,
order_number LONG,
order_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime))
) PARTITIONED BY (order_day_of_week)
COMMENT "Raw data on sales"
AS SELECT * FROM ...