Tutorial: Delta Lake
August 30, 2024
This tutorial introduces common Delta Lake operations on Databricks, including the following:
You can run the example Python, Scala, and SQL code in this article from within a notebook attached to a Databricks compute resource such as a cluster. You can also run the SQL code in this article from within a query associated with a SQL warehouse in Databricks SQL.
Prepare the source data
This tutorial relies on a dataset called People 10 M. It contains 10 million fictitious records that hold facts about people, like first and last names, date of birth, and salary. This tutorial assumes that this dataset is in a Unity Catalog volume that is associated with your target Databricks workspace.
To get the People 10 M dataset for this tutorial, do the following:
Go to the People 10 M page in Kaggle.
Click Download to download a file named
archive.zip
to your local machine.Extract the file named
export.csv
from thearchive.zip
file. Theexport.csv
file contains the data for this tutorial.
To upload the export.csv
file into the volume, do the following:
On the sidebar, click Catalog.
In Catalog Explorer, browse to and open the volume where you want to upload the
export.csv
file.Click Upload to this volume.
Drag and drop, or browse to and select, the
export.csv
file on your local machine.Click Upload.
In the following code examples, replace /Volumes/main/default/my-volume/export.csv
with the path to the export.csv
file in your target volume.
Create a table
All tables created on Databricks use Delta Lake by default. Databricks recommends using Unity Catalog managed tables.
In the previous code example and the following code examples, replace the table name main.default.people_10m
with your target three-part catalog, schema, and table name in Unity Catalog.
Note
Delta Lake is the default for all reads, writes, and table creation commands Databricks.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("firstName", StringType(), True),
StructField("middleName", StringType(), True),
StructField("lastName", StringType(), True),
StructField("gender", StringType(), True),
StructField("birthDate", TimestampType(), True),
StructField("ssn", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.read.format("csv").option("header", True).schema(schema).load("/Volumes/main/default/my-volume/export.csv")
# Create the table if it does not exist. Otherwise, replace the existing table.
df.writeTo("main.default.people_10m").createOrReplace()
# If you know the table does not already exist, you can call this instead:
# df.saveAsTable("main.default.people_10m")
The preceding operations create a new managed table. For information about available options when you create a Delta table, see CREATE TABLE.
In Databricks Runtime 13.3 LTS and above, you can use CREATE TABLE LIKE to create a new empty Delta table that duplicates the schema and table properties for a source Delta table. This can be especially useful when promoting tables from a development environment into production, as shown in the following code example:
CREATE TABLE main.default.people_10m_prod LIKE main.default.people_10m
To create an empty table, you can also use the DeltaTableBuilder
API in Delta Lake for Python and Scala. Compared to equivalent DataFrameWriter APIs, these APIs make it easier to specify additional information like column comments, table properties, and generated columns.
Preview
This feature is in Public Preview.
DeltaTable.createIfNotExists(spark)
.tableName("main.default.people_10m")
.addColumn("id", "INT")
.addColumn("firstName", "STRING")
.addColumn("middleName", "STRING")
.addColumn("lastName", "STRING", comment = "surname")
.addColumn("gender", "STRING")
.addColumn("birthDate", "TIMESTAMP")
.addColumn("ssn", "STRING")
.addColumn("salary", "INT")
.execute()
Upsert to a table
To merge a set of updates and insertions into an existing Delta table, you use the DeltaTable.merge
method for Python and Scala, and the MERGE INTO statement for SQL. For example, the following example takes data from the source table and merges it into the target Delta table. When there is a matching row in both tables, Delta Lake updates the data column using the given expression. When there is no matching row, Delta Lake adds a new row. This operation is known as an upsert.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
from datetime import date
schema = StructType([
StructField("id", IntegerType(), True),
StructField("firstName", StringType(), True),
StructField("middleName", StringType(), True),
StructField("lastName", StringType(), True),
StructField("gender", StringType(), True),
StructField("birthDate", DateType(), True),
StructField("ssn", StringType(), True),
StructField("salary", IntegerType(), True)
])
data = [
(9999998, 'Billy', 'Tommie', 'Luppitt', 'M', date.fromisoformat('1992-09-17'), '953-38-9452', 55250),
(9999999, 'Elias', 'Cyril', 'Leadbetter', 'M', date.fromisoformat('1984-05-22'), '906-51-2137', 48500),
(10000000, 'Joshua', 'Chas', 'Broggio', 'M', date.fromisoformat('1968-07-22'), '988-61-6247', 90000),
(20000001, 'John', '', 'Doe', 'M', date.fromisoformat('1978-01-14'), '345-67-8901', 55500),
(20000002, 'Mary', '', 'Smith', 'F', date.fromisoformat('1982-10-29'), '456-78-9012', 98250),
(20000003, 'Jane', '', 'Doe', 'F', date.fromisoformat('1981-06-25'), '567-89-0123', 89900)
]
people_10m_updates = spark.createDataFrame(data, schema)
people_10m_updates.createTempView("people_10m_updates")
# ...
from delta.tables import DeltaTable
deltaTable = DeltaTable.forName(spark, 'main.default.people_10m')
(deltaTable.alias("people_10m")
.merge(
people_10m_updates.alias("people_10m_updates"),
"people_10m.id = people_10m_updates.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
)
In SQL, if you specify *
, this updates or inserts all columns in the target table, assuming that the source table has the same columns as the target table. If the target table doesn’t have the same columns, the query throws an analysis error.
You must specify a value for every column in your table when you perform an insert operation (for example, when there is no matching row in the existing dataset). However, you do not need to update all values.
To see the results, query the table.
Read a table
You access data in Delta tables by the table name or the table path, as shown in the following examples:
Write to a table
Delta Lake uses standard syntax for writing data to tables.
To atomically add new data to an existing Delta table, use the append mode as shown in the following examples:
To replace all the data in a table, use the overwrite mode as in the following examples:
Update a table
You can update data that matches a predicate in a Delta table. For example, in the example people_10m
table, to change an abbreviation in the gender
column from M
or F
to Male
or Female
, you can run the following:
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forName(spark, "main.default.people_10m")
# Declare the predicate by using a SQL-formatted string.
deltaTable.update(
condition = "gender = 'F'",
set = { "gender": "'Female'" }
)
# Declare the predicate by using Spark SQL functions.
deltaTable.update(
condition = col('gender') == 'M',
set = { 'gender': lit('Male') }
)
Delete from a table
You can remove data that matches a predicate from a Delta table. For instance, in the example people_10m
table, to delete all rows corresponding to people with a value in the birthDate
column from before 1955
, you can run the following:
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forName(spark, "main.default.people_10m")
# Declare the predicate by using a SQL-formatted string.
deltaTable.delete("birthDate < '1955-01-01'")
# Declare the predicate by using Spark SQL functions.
deltaTable.delete(col('birthDate') < '1960-01-01')
Important
Deletion removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed. See vacuum for details.
Display table history
To view the history of a table, you use the DeltaTable.history
method for Python and Scala, and the DESCRIBE HISTORY statement in SQL, which provides provenance information, including the table version, operation, user, and so on, for each write to a table.
Query an earlier version of the table (time travel)
Delta Lake time travel allows you to query an older snapshot of a Delta table.
To query an older version of a table, specify the table’s version or timestamp. For example, to query version 0 or timestamp 2024-05-15T22:43:15.000+00:00Z
from the preceding history, use the following:
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "main.default.people_10m")
deltaHistory = deltaTable.history()
display(deltaHistory.where("version == 0"))
# Or:
display(deltaHistory.where("timestamp == '2024-05-15T22:43:15.000+00:00'"))
For timestamps, only date or timestamp strings are accepted, for example, "2024-05-15T22:43:15.000+00:00"
or "2024-05-15 22:43:15"
.
DataFrameReader options allow you to create a DataFrame from a Delta table that is fixed to a specific version or timestamp of the table, for example:
df = spark.read.option('versionAsOf', 0).table("main.default.people_10m")
# Or:
df = spark.read.option('timestampAsOf', '2024-05-15T22:43:15.000+00:00').table("main.default.people_10m")
display(df)
For details, see Work with Delta Lake table history.
Optimize a table
After you have performed multiple changes to a table, you might have a lot of small files. To improve the speed of read queries, you can use the optimize operation to collapse small files into larger ones:
Z-order by columns
To improve read performance further, you can collocate related information in the same set of files by z-ordering. Delta Lake data-skipping algorithms use this collocation to dramatically reduce the amount of data that needs to be read. To z-order data, you specify the columns to order on in the z-order by operation. For example, to collocate by gender
, run:
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "main.default.people_10m")
deltaTable.optimize().executeZOrderBy("gender")
For the full set of options available when running the optimize operation, see Optimize data file layout.
Clean up snapshots with VACUUM
Delta Lake provides snapshot isolation for reads, which means that it is safe to run an optimize operation even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the vacuum operation:
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "main.default.people_10m")
deltaTable.vacuum()
For details on using the vacuum operation effectively, see Remove unused data files with vacuum.