Delta Lake schema validation

Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the schema of the table. Delta Lake uses the following rules to determine whether a write from a DataFrame to a table is compatible:

  • All DataFrame columns must exist in the target table. If there are columns in the DataFrame not present in the table, an exception is raised. Columns present in the table but not in the DataFrame are set to null.

  • DataFrame column data types must match the column data types in the target table. If they don’t match, an exception is raised.

  • DataFrame column names cannot differ only by case. This means that you cannot have columns such as “Foo” and “foo” defined in the same table. While you can use Spark in case sensitive or insensitive (default) mode, Parquet is case sensitive when storing and returning column information. Delta Lake is case-preserving but insensitive when storing the schema and has this restriction to avoid potential mistakes, data corruption, or loss issues.

Delta Lake support DDL to add new columns explicitly and the ability to update schema automatically.

If you specify other options, such as partitionBy, in combination with append mode, Delta Lake validates that they match and throws an error for any mismatch. When partitionBy is not present, appends automatically follow the partitioning of the existing data.

Note

INSERT syntax provides schema enforcement and supports schema evolution. If a column’s data type cannot be safely cast to your Delta Lake table’s data type, then a runtime exception is thrown. If schema evolution is enabled, new columns can exist as the last columns of your schema (or nested columns) for the schema to evolve.

Schema validation for Delta Lake merge

merge automatically validates that the schema of the data generated by insert and update expressions are compatible with the schema of the table. It uses the following rules to determine whether the merge operation is compatible:

  • For update and insert actions, the specified target columns must exist in the target Delta table.

  • For updateAll and insertAll actions, the source dataset must have all the columns of the target Delta table. The source dataset can have extra columns and they are ignored.

  • For all actions, if the data type generated by the expressions producing the target columns are different from the corresponding columns in the target Delta table, merge tries to cast them to the types in the table.

See Automatic schema evolution for Delta Lake merge.