Configure Delta Lake to control data file size
Delta Lake provides options for manually or automatically configuring the target file size for writes and for OPTIMIZE
operations.
Limit rows written in a data file
You can use the SQL session configuration spark.sql.files.maxRecordsPerFile
to specify the maximum number of records to write to a single file for a Delta Lake table. Specifying a value of zero or a negative value represents no limit.
In Databricks Runtime 10.5 and above, you can also use the DataFrameWriter option maxRecordsPerFile
when using the DataFrame APIs to write to a Delta Lake table. When maxRecordsPerFile
is specified, the value of the SQL session configuration spark.sql.files.maxRecordsPerFile
is ignored.
(df.write
.mode("append")
.option("maxRecordsPerFile", "10000")
.save("/tmp/delta/people10m")
)
df.write
.mode("append")
.option("maxRecordsPerFile", "10000")
.save("/tmp/delta/people10m")
Set a target file size
Note
Available in Databricks Runtime 8.2 and above.
If you want to tune the size of files in your Delta table, set the table property delta.targetFileSize
to the desired size. If this property is set, all data layout optimization operations will make a best-effort attempt to generate files of the specified size. Examples here include optimize or Z-order, Auto Compaction, and Optimized Writes.
Table property |
---|
delta.targetFileSize Type: Size in bytes or higher units. The target file size. For example, Default value: None |
For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL PROPERTIES. You can also set these properties automatically when creating new tables using Spark session configurations. See Delta table properties reference for details.
Autotune file size based on workload
Note
Available in Databricks Runtime 8.2 and above.
To minimize the need for manual tuning, Databricks can automatically tune the file size of Delta tables, based on workloads operating on the table. Databricks can automatically detect if a Delta table has frequent MERGE
operations that rewrite files and may choose to reduce the size of rewritten files in anticipation of further file rewrites in the future. For example, when executing a MERGE
operation, if 9 out of last 10 previous operations on the table were also MERGEs, then Optimized Writes and Auto Compaction used by MERGE
(if enabled) will generate smaller file sizes than it would otherwise. This helps in reducing the duration of future MERGE
operations.
Autotune is activated after a few rewrite operations have occurred. However, if you anticipate a Delta table will experience frequent MERGE
, UPDATE
, or DELETE
operations and want this tuning immediately, you can explicitly tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites
. Set this property to true
to always use lower file sizes for all data layout optimization operations on the table. Set it to false
to never tune to lower file sizes, that is, prevent auto-detection from being activated.
Table property |
---|
delta.tuneFileSizesForRewrites Type: Whether to tune file sizes for data layout optimization. Default value: None |
For existing tables, you can set and unset properties using the SQL command ALTER TABLE SET TBL PROPERTIES. You can also set these properties automatically when creating new tables using Spark session configurations. See Delta table properties reference for details.
Autotune file size based on table size
Note
Available in Databricks Runtime 8.4 and above.
To minimize the need for manual tuning, Databricks automatically tunes the file size of Delta tables based on the size of the table. Databricks will use smaller file sizes for smaller tables and larger file sizes for larger tables so that the number of files in the table does not grow too large. Databricks does not autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.
The target file size is based on the current size of the Delta table. For tables smaller than 2.56 TB, the autotuned target file size is 256 MB. For tables with a size between 2.56 TB and 10 TB, the target size will grow linearly from 256 MB to 1 GB. For tables larger than 10 TB, the target file size is 1 GB.
Note
When the target file size for a table grows, existing files are not re-optimized into larger files by the OPTIMIZE
command. A large table can therefore always have some files that are smaller than the target size. If it is required to optimize those smaller files into larger files as well, you can configure a fixed target file size for the table using the delta.targetFileSize
table property.
When a table is written incrementally, the target file sizes and file counts will be close to the following numbers, based on table size. The file counts in this table are only an example. The actual results will be different depending on many factors.
Table size |
Target file size |
Approximate number of files in table |
---|---|---|
10 GB |
256 MB |
40 |
1 TB |
256 MB |
4096 |
2.56 TB |
256 MB |
10240 |
3 TB |
307 MB |
12108 |
5 TB |
512 MB |
17339 |
7 TB |
716 MB |
20784 |
10 TB |
1 GB |
24437 |
20 TB |
1 GB |
34437 |
50 TB |
1 GB |
64437 |
100 TB |
1 GB |
114437 |