Databricks IO Cache

The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.

The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.

Databricks IO and RDD cache comparison

There are two types of cache available in Databricks:

  • Databricks IO (IO) cache
  • Apache Spark RDD (RDD) cache

You can use the IO cache and the RDD cache at the same time. In this section we outline the key differences between them, so that you can choose the best tool for your workflow.

Type of stored data

The IO cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries.

The RDD cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).

Performance
The data stored in the IO cache can be read and operated on faster than the data in the RDD cache. This is because the IO cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
Automatic vs manual control

When the IO cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data).

When you use the RDD cache, you must manually specify the tables and queries to cache.

Disk vs memory-based
The IO cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the IO cache can be fully disk-resident without a negative impact on its performance. In contrast, the RDD cache uses memory.

Note

When using AWS i3 instances as worker nodes with Databricks Runtime 3.3 and above, the IO cache is enabled by default and preconfigured to use at most half of the space available on the SSDs provided with the chosen instance type.

No further configuration is required and you can skip the following steps.

Configure the IO cache

To configure the IO cache, during cluster creation specify the following Spark configuration settings:

spark.databricks.io.cache.maxDiskUsage <disk space per node reserved for cached data in bytes>
spark.databricks.io.cache.maxMetaDataCache <disk space per node reserved for cached metadata in bytes>

For example:

spark.databricks.io.cache.maxDiskUsage 14000000000
spark.databricks.io.cache.maxMetaDataCache 300000000

Databricks recommends that you use the following values:

  • spark.databricks.io.cache.maxDiskUsage: 50% of the total available storage space per node
  • spark.databricks.io.cache.maxMetaDataCache: 2% of the space dedicated to spark.databricks.io.cache.maxDiskUsage

Configure EBS local storage

By default, the IO cache data is stored on the instance local disk. For instance types that do not have a local disk, you can add extra EBS volumes during cluster creation to be used for cache storage:

../_images/ebs.png

Warning

The IO cache shares disk space with Spark shuffle outputs. Dedicating too large portion of the storage to the cache can result in out of disk space errors.

See EBS volumes for more details.

Enable the IO cache

To enable and disable the IO cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents queries from adding new data to the cache and reading data from the cache.

Cache a subset of the data

To explicitly select a subset of data to be cached, use following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

You don’t need to use this command for the Databricks IO cache to work correctly (the data will be automatically cached when first accessed). But it can be helpful when you require consistent query performance.

See Cache for examples and more details.

Monitor the IO cache

You can check the current state of the IO cache on each of the executors in the Storage tab in the Spark UI.

../_images/DBIO-SparkUI.png

When a node reaches 100% disk usage, the cache manager discards the least recently used cache entries to make space for new data.