Skip to main content

Selectively overwrite data with Delta Lake

Databricks leverages Delta Lake functionality to support two distinct options for selective overwrites:

  • The replaceWhere option atomically replaces all records that match a given predicate.
  • You can replace directories of data based on how tables are partitioned using dynamic partition overwrites.

For most operations, Databricks recommends using replaceWhere to specify which data to overwrite.

important

If data has been accidentally overwritten, you can use restore to undo the change.

Arbitrary selective overwrite with replaceWhere

You can selectively overwrite only the data that matches an arbitrary expression.

note

SQL requires Databricks Runtime 12.2 LTS or above.

The following command atomically replaces events in January in the target table, which is partitioned by start_date, with the data in replace_data:

Python
(replace_data.write
.mode("overwrite")
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'")
.table("events")
)

This sample code writes out the data in replace_data, validates that all rows match the predicate, and performs an atomic replacement using overwrite semantics. If any values in the operation fall outside the constraint, this operation fails with an error by default.

You can change this behavior to overwrite values within the predicate range and insert records that fall outside the specified range. To do so, disable the constraint check by setting spark.databricks.delta.replaceWhere.constraintCheck.enabled to false using one of the following settings:

Python
spark.conf.set("spark.databricks.delta.replaceWhere.constraintCheck.enabled", False)

Legacy behavior

Legacy default behavior had replaceWhere overwrite data matching a predicate over partition columns only. With this legacy model, the following command would atomically replace the month January in the target table, which is partitioned by date, with the data in df:

Python
(df.write
.mode("overwrite")
.option("replaceWhere", "birthDate >= '2017-01-01' AND birthDate <= '2017-01-31'")
.table("people10m")
)

If you want to fall back to the old behavior, you can disable the spark.databricks.delta.replaceWhere.dataColumns.enabled flag:

Python
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", False)

Dynamic partition overwrites

Dynamic partition overwrite updates only the partitions for which the write commits new data. It overwrites all existing data in those partitions and leaves others unchanged.

Databricks supports two approaches:

  • REPLACE USING (recommended) - Works across all compute types, including Databricks SQL warehouses, serverless compute, and classic compute. Doesn't require setting a Spark session configuration.
  • partitionOverwriteMode (legacy) - Requires classic compute and setting a Spark session configuration. Not supported on Databricks SQL or serverless compute.

The sections below demonstrate how to use each approach.

Dynamic partition overwrites with REPLACE USING

Preview

This feature is in Public Preview.

Databricks Runtime 16.3 and above supports dynamic partition overwrites for partitioned tables using REPLACE USING. This method lets you selectively overwrite data across all compute types, without needing to set a Spark session configuration. REPLACE USING enables compute-independent, atomic overwrite behavior that works on Databricks SQL warehouses, serverless compute, and classic compute.

REPLACE USING only overwrites the partitions targeted by the incoming data. All other partitions remain unchanged.

The following example demonstrates how to use dynamic partition overwrite with REPLACE USING. Currently, you can only use SQL, not Python or Scala. For details, see INSERT in the SQL language reference.

SQL
INSERT INTO TABLE events
REPLACE USING (event_id, start_date)
SELECT * FROM source_data

Keep the following constraints and behaviors in mind for dynamic partition overwrites:

  • You must specify the full set of the table's partition columns in the USING clause.
  • Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.

Dynamic partition overwrites with partitionOverwriteMode (legacy)

Preview

This feature is in Public Preview.

Databricks Runtime 11.3 LTS and above supports dynamic partition overwrites for partitioned tables using overwrite mode: either INSERT OVERWRITE in SQL, or a DataFrame write with df.write.mode("overwrite"). This type of overwrite is only available for classic compute, not Databricks SQL warehouses or serverless compute.

Configure dynamic partition overwrite mode by setting the Spark session configuration spark.sql.sources.partitionOverwriteMode to dynamic. Alternatively, you can set the DataFrameWriter option partitionOverwriteMode to dynamic. If present, the query-specific option overrides the mode defined in the session configuration. The default for spark.sql.sources.partitionOverwriteMode is static.

The following example demonstrates using partitionOverwriteMode:

SQL
SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE TABLE default.people10m SELECT * FROM morePeople;

Keep the following constraints and behaviors in mind for partitionOverwriteMode:

  • You can't set overwriteSchema to true.
  • You can't specify both partitionOverwriteMode and replaceWhere in the same DataFrameWriter operation.
  • If you specify a replaceWhere condition using a DataFrameWriter option, Delta Lake applies that condition to control which data is overwritten. This option takes precedence over the partitionOverwriteMode session-level configuration.
  • Always validate that the data written touches only the expected partitions. A single row in the wrong partition can unintentionally overwrite the entire partition.