Automatic type widening with Auto Loader

Preview

This feature is in Public Preview in Databricks Runtime 16.4 and above.

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It also reduces pipeline maintenance by automatically handling complex schema changes. For example, you can configure Auto Loader to automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema. You can also evolve the table schema as new columns are introduced, eliminating the need to manually track and apply schema changes over time. Auto Loader can even rescue data that is unexpected (for example, because of differing data types) in a rescued data column, helping you avoid data loss.

However, the rescued data column requires that you manually deal with any data type changes.

To automatically handle some of these data type changes, use type widening in Auto Loader. Delta Lake now supports different data type widening changes without requiring data rewrite or user intervention. See Delta Lake Type widening. The new mode for schema evolution, addNewColumnsWithTypeWidening, automatically evolve schema on compatible data type changes.

You can widen primitive types like int to long, float to double, and more. Type widening is available for all file formats with schema evolution support in Auto Loader. This includes text formats (like JSON, CSV, or XML) and binary formats (like Avro or Parquet). There is no change in schema evolution behavior for existing schema evolution modes (such as addNewColumns, rescue, failOnNewColumns, or none).

Supported type changes

The following type changes are supported:

Source type	Supported wider types
`byte`	`short`, `int`, `long`, `decimal`, `double`
`short`	`int`, `long`, `decimal`, `double`
`int`	`long`, `decimal`, `double`
`long`	`decimal`
`float`	`double`
`decimal`	`decimal` with greater precision and scale
`date`	`timestampNTZ` (only for Parquet files)

When widening any numeric type to decimal, Auto Loader widens to decimal with precision equal to or greater than the starting precision. If you increase the scale, the total precision increases by a corresponding amount.

The starting precision of integer types is the following:

Type	Starting precision
`byte`	`10`
`short`	`10`
`int`	`10`
`long`	`20`

For example, if the current type of a column is int, and a file with the type of that column as decimal(5, 2) is read, Auto Loader widens the type of that column to decimal(12, 2).

Prerequisites

To use type widening with Auto Loader, you must meet the following requirements:

Use Databricks Runtime 16.4 or above.
If the write sink is a Delta Lake table, enable type widening for the Delta Lake table using one of the following methods:
- If using an existing table:
  SQL
```
ALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true')
```
- If creating a new table with type widening enabled:
  SQL
```
CREATE TABLE T(c1 INT) TBLPROPERTIES('delta.enableTypeWidening' = 'true')
```

For more information about type widening in Delta Lake tables, see Type widening.

Enable type widening with schema evolution

To use type widening with Auto Loader, specify addNewColumnsWithTypeWidening when using schema evolution. Auto Loader detects the addition of new columns and type changes as it processes your data.

Python
Scala

Python
query = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("cloudFiles.inferColumnTypes", True)
  .option("cloudFiles.schemaLocation", <schemaPath>)
  .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWidening")
  .load(<inputPath>)
  .writeStream
  .option("mergeSchema", "true")
  .option("checkpointLocation", <checkpointPath>)
  .trigger(availableNow=True)
  .toTable("table_name")
)

Scala
val query = spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("cloudFiles.inferColumnTypes", true)
  .option("cloudFiles.schemaLocation", <schemaPath>)
  .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWidening")
  .load(<inputPath>)
  .writeStream
  .option("mergeSchema", "true")
  .option("checkpointLocation", <checkpointPath>)
  .trigger(Trigger.AvailableNow())
  .toTable("table_name")

When Auto Loader detects a new column or a type change that's supported by type widening, the stream stops with an UnknownFieldException. Before your stream throws this error, Auto Loader performs schema inference on the latest micro-batch of data and updates the schema location with the latest schema by widening the existing columns or merging new columns to the end of the schema.

Schema evolution behavior on data type changes

If you were to ingest a CSV with the following content, Auto Loader infers the schema as STRUCT<id INT, name STRING, _rescued_data STRING>.

CSV
id, name
1, John
2, Mary

The target table looks as follows:

id	name	_rescued_data
1	John	NULL
2	Mary	NULL

Now, ingest another CSV file where the values in the id column are wider than INT type:

CSV
id, name, age
2147483648, Bob, 25

The following table explains the behavior and output with different schema evolution modes in Auto Loader:

Mode	Behavior with supported widenable data type change
`addNewColumns` (default)	The data type does not evolve, and the stream does not fail due to a data type change. Columns with type-mismatched values are set to `NULL`, and the mismatched values are added to the rescued data column. Stream fails on new columns.
`rescue`	Schema does not evolve, and streams do not fail due to any schema changes. Columns with type-mismatched values are set to `NULL`, and the mismatched values are added to the rescued data column.
`failOnNewColumns`	The data type does not evolve, and the stream does not fail due to a data type change. Columns with type-mismatched values are set to `NULL`, and the mismatched values are added to the rescued data column. Stream fails on new columns without evolving the schema.
`none`	Does not evolve the schema, new columns are ignored, and data is not rescued unless the `rescuedDataColumn` option is set. Stream does not fail due to schema changes.
`addNewColumnsWithTypeWidening`	Stream fails. New columns are added to the schema, and supported data type changes are widened. Unsupported data type changes (for example, `int` to `string`) are added to the rescued data column.

Example results

The following sections show the inferred schema and values for each schema evolution mode after ingesting the second CSV file.

`addNewColumns`

Schema: id: INT, name: STRING, age: INT, _rescued_data: STRING

id	name	age	_rescued_data
1	John	NULL	NULL
2	Mary	NULL	NULL
NULL	Bob	25	`{"id": 2147483648}`

`rescue`

Schema: id: INT, name: STRING, _rescued_data: STRING

id	name	_rescued_data
1	John	NUL
2	Mary	NULL
NULL	Bob	`{"age": 25, "id": 2147483648}`

`failOnNewColumns`

Schema: id: IN_data: STRING

id	name	_rescued_data
1	John	NULL
2	Mary	NULL
NULL	Bob	`{"id": 2147483648}`

`none`

Schema: `id: IN

id	name
1	John
2	Mary
NULL	Bob

`addNewColumnsWithTypeWidening`

Schema: id: BIGINT, name: STRING, age: INT, _rescued_data: STRING

id	name	age	_rescued_data
1	John	NULL	NULL
2	Mary	NULL	NULL
2147483648	Bob	25	NULL

Limitations

The option prefersDecimal cannot be set to false when using addNewColumnsWithTypeWidening. When addNewColumnsWithTypeWidening is specified, the default value of prefersDecimal is true.
date to timestampNTZ widening is only supported for Parquet files.

Supported type changes​

Prerequisites​

Enable type widening with schema evolution​

Schema evolution behavior on data type changes​

Example results​

addNewColumns​

rescue​

failOnNewColumns​

none​

addNewColumnsWithTypeWidening​

Limitations​