Selectively overwrite data with Delta Lake
Databricks leverages Delta Lake functionality to support two distinct options for selective overwrites:
- The
replaceWhere
option atomically replaces all records that match a given predicate. - You can replace directories of data based on how tables are partitioned using dynamic partition overwrites.
For most operations, Databricks recommends using replaceWhere
to specify which data to overwrite.
If data has been accidentally overwritten, you can use restore to undo the change.
Arbitrary selective overwrite with replaceWhere
You can selectively overwrite only the data that matches an arbitrary expression.
SQL requires Databricks Runtime 12.2 LTS or above.
The following command atomically replaces events in January in the target table, which is partitioned by start_date
, with the data in replace_data
:
- Python
- Scala
- SQL
(replace_data.write
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.table("events")
)
replace_data.write
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.table("events")
INSERT INTO TABLE events REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31' SELECT * FROM replace_data
This sample code writes out the data in replace_data
, validates that all rows match the predicate, and performs an atomic replacement using overwrite
semantics. If any values in the operation fall outside the constraint, this operation fails with an error by default.
You can change this behavior to overwrite
values within the predicate range and insert
records that fall outside the specified range. To do so, disable the constraint check by setting spark.databricks.delta.replaceWhere.constraintCheck.enabled
to false using one of the following settings:
- Python
- Scala
- SQL
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", false)
SET spark.databricks.delta.replaceWhere.constraintCheck.enabled=false
Legacy behavior
Legacy default behavior had replaceWhere
overwrite data matching a predicate over partition columns only. With this legacy model, the following command would atomically replace the month January in the target table, which is partitioned by date
, with the data in df
:
- Python
- Scala
(df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.table("people10m")
)
df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.table("people10m")
If you want to fall back to the old behavior, you can disable the spark.databricks.delta.replaceWhere.dataColumns.enabled
flag:
- Python
- Scala
- SQL
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)
SET spark.databricks.delta.replaceWhere.dataColumns.enabled=false
Dynamic partition overwrites
Dynamic partition overwrite updates only the partitions for which the write commits new data. It overwrites all existing data in those partitions and leaves others unchanged.
Databricks supports two approaches:
REPLACE USING
(recommended) - Works across all compute types, including Databricks SQL warehouses, serverless compute, and classic compute. Doesn't require setting a Spark session configuration.partitionOverwriteMode
(legacy) - Requires classic compute and setting a Spark session configuration. Not supported on Databricks SQL or serverless compute.
The sections below demonstrate how to use each approach.
Dynamic partition overwrites with REPLACE USING
This feature is in Public Preview.
Databricks Runtime 16.3 and above supports dynamic partition overwrites for partitioned tables using REPLACE USING
. This method lets you selectively overwrite data across all compute types, without needing to set a Spark session configuration. REPLACE USING
enables compute-independent, atomic overwrite behavior that works on Databricks SQL warehouses, serverless compute, and classic compute.
REPLACE USING
only overwrites the partitions targeted by the incoming data. All other partitions remain unchanged.
The following example demonstrates how to use dynamic partition overwrite with REPLACE USING
. Currently, you can only use SQL, not Python or Scala. For details, see INSERT in the SQL language reference.
INSERT INTO TABLE events
REPLACE USING (event_id, start_date)
SELECT * FROM source_data
Keep the following constraints and behaviors in mind for dynamic partition overwrites:
- You must specify the full set of the table's partition columns in the
USING
clause. - Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.
Dynamic partition overwrites with partitionOverwriteMode
(legacy)
This feature is in Public Preview.
Databricks Runtime 11.3 LTS and above supports dynamic partition overwrites for partitioned tables using overwrite mode: either INSERT OVERWRITE
in SQL, or a DataFrame write with df.write.mode("overwrite")
. This type of overwrite is only available for classic compute, not Databricks SQL warehouses or serverless compute.
Configure dynamic partition overwrite mode by setting the Spark session configuration spark.sql.sources.partitionOverwriteMode
to dynamic
. Alternatively, you can set the DataFrameWriter
option partitionOverwriteMode
to dynamic
. If present, the query-specific option overrides the mode defined in the session configuration. The default for spark.sql.sources.partitionOverwriteMode
is static
.
The following example demonstrates using partitionOverwriteMode
:
- SQL
- Python
- Scala
SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;
(df.write
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
)
df.write
.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.saveAsTable("default.people10m")
Keep the following constraints and behaviors in mind for partitionOverwriteMode
:
- You can't set
overwriteSchema
totrue
. - You can't specify both
partitionOverwriteMode
andreplaceWhere
in the sameDataFrameWriter
operation. - If you specify a
replaceWhere
condition using aDataFrameWriter
option, Delta Lake applies that condition to control which data is overwritten. This option takes precedence over thepartitionOverwriteMode
session-level configuration. - Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.