Configure Delta Lake to control data file size

Delta Lake provides options for manually or automatically configuring the target file size for writes and for OPTIMIZE operations.

Limit rows written in a data file

You can use the SQL session configuration spark.sql.files.maxRecordsPerFile to specify the maximum number of records to write to a single file for a Delta Lake table. Specifying a value of zero or a negative value represents no limit.

In Databricks Runtime 10.5 and above, you can also use the DataFrameWriter option maxRecordsPerFile when using the DataFrame APIs to write to a Delta Lake table. When maxRecordsPerFile is specified, the value of the SQL session configuration spark.sql.files.maxRecordsPerFile is ignored.

(df.write
  .mode("append")
  .option("maxRecordsPerFile", "10000")
  .save("/tmp/delta/people10m")
)
df.write
  .mode("append")
  .option("maxRecordsPerFile", "10000")
  .save("/tmp/delta/people10m")

Set a target file size

Note

Available in Databricks Runtime 8.2 and above.

If you want to tune the size of files in your Delta table, set the table property delta.targetFileSize to the desired size. If this property is set, all data layout optimization operations will make a best-effort attempt to generate files of the specified size. Examples here include optimize or Z-order, Auto Compaction, and Optimized Writes.

Table property

delta.targetFileSize

Type: Size in bytes or higher units.

The target file size. For example, 104857600 (bytes) or 100mb.

Default value: None

For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL PROPERTIES. You can also set these properties automatically when creating new tables using Spark session configurations. See Delta table properties reference for details.

Autotune file size based on workload

Note

Available in Databricks Runtime 8.2 and above.

To minimize the need for manual tuning, Databricks can automatically tune the file size of Delta tables, based on workloads operating on the table. Databricks can automatically detect if a Delta table has frequent MERGE operations that rewrite files and may choose to reduce the size of rewritten files in anticipation of further file rewrites in the future. For example, when executing a MERGE operation, if 9 out of last 10 previous operations on the table were also MERGEs, then Optimized Writes and Auto Compaction used by MERGE (if enabled) will generate smaller file sizes than it would otherwise. This helps in reducing the duration of future MERGE operations.

Autotune is activated after a few rewrite operations have occurred. However, if you anticipate a Delta table will experience frequent MERGE, UPDATE, or DELETE operations and want this tuning immediately, you can explicitly tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites. Set this property to true to always use lower file sizes for all data layout optimization operations on the table. Set it to false to never tune to lower file sizes, that is, prevent auto-detection from being activated.

Table property

delta.tuneFileSizesForRewrites

Type: Boolean

Whether to tune file sizes for data layout optimization.

Default value: None

For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL PROPERTIES. You can also set these properties automatically when creating new tables using Spark session configurations. See Delta table properties reference for details.

Autotune file size based on table size

Note

Available in Databricks Runtime 8.4 and above.

To minimize the need for manual tuning, Databricks automatically tunes the file size of Delta tables based on the size of the table. Databricks will use smaller file sizes for smaller tables and larger file sizes for larger tables so that the number of files in the table does not grow too large. Databricks does not autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.

The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB.

Note

When the target file size for a table grows, existing files are not re-optimized into larger files by the OPTIMIZE command. A large table can therefore always have some files that are smaller than the target size. If it is required to optimize those smaller files into larger files as well, you can configure a fixed target file size for the table using the delta.targetFileSize table property.

When a table is written incrementally, the target file sizes and file counts will be close to the following numbers, based on table size. The file counts in this table are only an example. The actual results will be different depending on many factors.

Table size

Target file size

Approximate number of files in table

10 GB

256 MB

40

1 TB

256 MB

4096

2.56 TB

256 MB

10240

3 TB

307 MB

12108

5 TB

512 MB

17339

7 TB

716 MB

20784

10 TB

1 GB

24437

20 TB

1 GB

34437

50 TB

1 GB

64437

100 TB

1 GB

114437