The Databricks DBIO package provides transactional writes to cloud storage for Apache Spark jobs. This solves a number of performance and correctness issues that occur when Spark is used in a cloud-native setting (for example, writing directly to storage services).
With DBIO transactional commit, metadata files starting with
_committed_<id> accompany data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, you should use the
VACUUM command to clean them up.
To clean up uncommitted files left over from Spark jobs, use the
VACUUM command to remove them. Normally
VACUUM happens automatically after Spark jobs complete, but you can also run it manually if a job is aborted.
VACUUM ... RETAIN 1 HOUR removes uncommitted files older than one hour.
- Avoid vacuuming with a horizon of less than one hour. It can cause data inconsistency.
Also see Vacuum.
-- recursively vacuum an output path VACUUM '/path/to/output/directory' [RETAIN <N> HOURS] -- vacuum all partitions of a catalog table VACUUM tableName [RETAIN <N> HOURS]
// recursively vacuum an output path spark.sql("VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]") // vacuum all partitions of a catalog table spark.sql("VACUUM tableName [RETAIN <N> HOURS]")