The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.
The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports reading Parquet files in Amazon S3, DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. It does not support other storage formats such as CSV, JSON, and ORC.
There are two types of caching available in Databricks: Delta caching and Spark caching. Here are the characteristics of each type:
- Type of stored data: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
- Performance: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
- Automatic vs manual control: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the
CACHE SELECTcommand (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
- Disk vs memory-based: The Delta cache is stored on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.
You can use Delta caching and Apache Spark caching at the same time.
The following table summarizes the key differences between Delta and Apache Spark caching so that you can choose the best tool for your workflow:
|Feature||Delta cache||Apache Spark cache|
|Stored as||Local files on a worker node.||In-memory blocks, but it depends on storage level.|
|Applied to||Any Parquet table stored on S3, WASB, and other file systems.||Any DataFrame or RDD.|
|Triggered||Automatically, on the first read (if cache is enabled).||Manually, requires code changes.|
|Availability||Can be enabled or disabled with configuration flags, disabled on certain node types.||Always available.|
|Evicted||Automatically on any file change, manually when restarting a cluster.||Automatically in LRU fashion, manually with
The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data.
The Delta cache automatically detects files that have been modified or overwritten after being cached. Any stale entries are automatically invalidated and evicted from the cache.
The recommended (and easiest) way to use Delta caching is to choose a Delta Cache Accelerated—
i3en series—worker type when you configure your cluster. Such workers are enabled and configured for Delta caching.
c5d, r5d, and z1d series workers are configured for Delta caching, but not enabled by default. To enable for caching, see Enable or disable the Delta cache.
The Delta cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the Delta cache.
To explicitly select a subset of data to be cached, use the following syntax:
CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]
You don’t need to use this command for the Delta cache to work correctly (the data will be cached automatically when first accessed). But it can be helpful when you require consistent query performance.
For examples and more details, see
You can check the current state of the Delta cache for each of the executors in the Storage tab of the Spark UI.
The first table summarizes the following metrics for each of the active executor nodes:
- Disk Usage: The total size used by the Delta cache manager for storing Parquet data pages.
- Max Disk Usage Limit: The maximum size of the disk that can be allocated to the Delta cache manager for storing Parquet data pages.
- Percent Disk Usage: The fraction of disk space used by the Delta cache manager out of the maximum size that can be allocated for storing Parquet data pages. When a node reaches 100% disk usage, the cache manager discards the least recently used cache entries to make space for new data.
- Metadata Cache Size: The total size used for caching Parquet metadata (file footers).
- Max Metadata Cache Size Limit: The maximum size of the disk that can be allocated to the Delta cache manager for caching Parquet metadata (file footers).
- Percent Metadata Usage: The fraction of disk space used by the Delta cache manager out of the maximum size that can be allocated for Parquet metadata (file footers).
- Data Read from IO Cache (Cache Hits): The total size of Parquet data read from the IO cache for this node.
- Data Written to IO Cache (Cache Misses): The total size of Parquet data not found in and consequently written to the IO cache for this node.
- Cache Hit Ratio: The fraction of Parquet data read from IO cache out of all Parquet data read for this node.
The second table summarizes the following metrics for all nodes across the cluster runtime, including nodes not currently active:
- Data Read from External Filesystem (All Formats): The total size of data read of any format from an external filesystem, that is, not from the IO cache.
- Data Read from IO Cache (Cache Hits): The total size of Parquet data read from the IO cache across the cluster runtime.
- Data Written to IO Cache (Cache Misses): The total size of Parquet data not found in and consequently written to the IO cache across the cluster runtime.
- Cache Hit Ratio: The fraction of total Parquet data read from IO cache out of all Parquet data read across the cluster runtime.
- Estimated Size of Repeatedly Read Data: The approximate size of data read two or more times. This column is displayed only if
- Cache Metadata Manager Peak Disk Usage: The peak total size used by the Delta cache manager to run the IO cache.
Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such instances are automatically configured optimally for the Delta cache.
To configure how the Delta cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:
spark.databricks.io.cache.maxDiskUsage: disk space per node reserved for cached data in bytes
spark.databricks.io.cache.maxMetaDataCache: disk space per node reserved for cached metadata in bytes
spark.databricks.io.cache.compression.enabled: should the cached data be stored in compressed format
spark.databricks.io.cache.maxDiskUsage 50g spark.databricks.io.cache.maxMetaDataCache 1g spark.databricks.io.cache.compression.enabled false
To enable and disable the Delta cache, run:
spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.