Update Delta Lake Tables

Delta Lake on Databricks supports several statements to facilitate updating Delta Lake tables.

UPDATE

The UPDATE statement allows you to apply expressions to change the value of columns when a row matches a predicate. For example, you can use UPDATE to fix a spelling mistake in the eventType:

UPDATE events SET eventType = 'click' WHERE eventType = 'clck'

Similar to delete, update operations automatically make use of the partitioning of the table when possible.

Upserts (MERGE INTO)

The MERGE INTO statement allows you to merge a set of updates and insertions into an existing dataset. For example, the following statement takes a stream of updates and merges it into the events table. When there is already an event present with the same eventId, Delta Lake updates the data column using the given expression. When there is no matching event, Delta Lake adds a new row.

Here’s a worked example:

MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
  UPDATE SET
    events.data = updates.data
WHEN NOT MATCHED
  THEN INSERT (date, eventId, data) VALUES (date, eventId, data)

You must specify a value for every column in your table when you perform an INSERT (for example, when there is no matching row in the existing dataset). However, you do not need to update all values.

Important

MERGE INTO requires that the update table is small. There are no requirements on the destination table size. If your workload does not satisfy this requirement, try using separate INSERT and UPDATE statements.

Tip

You should add as much information to the ON condition in MERGE INTO to both reduce the amount of work and reduce the chances of transaction conflicts. For example, suppose you have a table that is partitioned by country and date and you use MERGE to update information for the last day country by country. If you’re updating country='USA', then you can write a MERGE statement such as:

MERGE INTO target_table
USING source
ON target_table.user_id = source.user_id AND target_table.date = current_date() AND country = 'USA'
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

This lets you break a very large MERGE operation into smaller chunks and run them all in parallel to get better performance or meet SLAs.

DELETE FROM

Delta Lake tables allow you to remove data that matches a predicate. For instance, to delete all events from before 2017, you can run the following DML:

DELETE FROM events WHERE date < '2017-01-01'

Delete operations automatically make use of the partitioning of the table when possible. This optimization means that it will be significantly faster to delete data based on partition predicates.