Optimize performance with caching on Databricks

Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. The cache works for all Parquet data files (including Delta Lake tables).

Delta cache renamed to disk cache

Disk caching on Databricks was formerly referred to as the Delta cache and the DBIO cache. Disk caching behavior is a proprietary Databricks feature. This name change seeks to resolve confusion that it was part of the Delta Lake protocol. Behavior remains unchanged with this rename, and is described below.

Automatic and manual caching

The Databricks disk cache differs from Apache Spark caching. Databricks recommends using automatic disk caching for most operations.

When the disk cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE SELECT command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.

The disk cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).

The data stored in the disk cache can be read and operated on faster than the data in the Spark cache. This is because the disk cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.

Unlike the Spark cache, disk caching does not use system memory. Due to the high read speeds of modern SSDs, the disk cache can be fully disk-resident without a negative impact on its performance.

Summary

The following table summarizes the key differences between disk and Apache Spark caching so that you can choose the best tool for your workflow:

Feature

disk cache

Apache Spark cache

Stored as

Local files on a worker node.

In-memory blocks, but it depends on storage level.

Applied to

Any Parquet table stored on S3, ABFS, and other file systems.

Any DataFrame or RDD.

Triggered

Automatically, on the first read (if cache is enabled).

Manually, requires code changes.

Evaluated

Lazily.

Lazily.

Force cache

CACHE SELECT command

.cache + any action to materialize the cache and .persist.

Availability

Can be enabled or disabled with configuration flags, enabled by default on certain node types.

Always available.

Evicted

Automatically in LRU fashion or on any file change, manually when restarting a cluster.

Automatically in LRU fashion, manually with unpersist.

Disk cache consistency

The disk cache automatically detects when data files are created, deleted, modified, or overwritten and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. Any stale entries are automatically invalidated and evicted from the cache.

Use disk caching

The recommended (and easiest) way to use disk caching is to choose a worker type with SSD volumes when you configure your cluster. Such workers are enabled and configured for disk caching.

c5d, r5d, and z1d series workers are configured for disk caching, but not enabled by default. To enable for caching, see Enable or disable the disk cache.

The disk cache is configured to use at most half of the space available on the local SSDs provided with the worker nodes. For configuration options, see Configure the disk cache.

Cache a subset of the data

To explicitly select a subset of data to be cached, use the following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

You don’t need to use this command for the disk cache to work correctly (the data will be cached automatically when first accessed). But it can be helpful when you require consistent query performance.

For examples and more details, see

Configure the disk cache

Databricks recommends that you choose cache-accelerated worker instance types for your clusters. Such instances are automatically configured optimally for the disk cache.

Note

When a worker is decommissioned, the Spark cache stored on that worker is lost. So if autoscaling is enabled, there is some instability with the cache. Spark would then need to reread missing partitions from source as needed.

Configure disk usage

To configure how the disk cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:

  • spark.databricks.io.cache.maxDiskUsage: disk space per node reserved for cached data in bytes

  • spark.databricks.io.cache.maxMetaDataCache: disk space per node reserved for cached metadata in bytes

  • spark.databricks.io.cache.compression.enabled: should the cached data be stored in compressed format

Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

Enable or disable the disk cache

To enable and disable the disk cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.