Databricks IO Cache

The DBIO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data can then be executed locally, which results in significantly improved reading speed.

The DBIO cache currently supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage and Azure Data Lake.

The Difference Between RDD and DBIO Cache

There are two types of cache available in Databricks:

  • the RDD cache which is a part of Apache Spark,
  • and the Databricks IO cache which is available only to Databricks customers.

Both the RDD cache and the Databricks IO cache can be used at the same time without issue. In this section we outline the key differences between them, so that the users can choose the best tool for their workflow.

Type of stored data

The RDD cache can be used to store the result of any subquery.

The DBIO cache is designed to speed-up scans by creating local copies of remote data. Therefore, it can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries.

Performance
The data stored in the DBIO cache can be read and operated on faster than the data in the RDD cache. This is because the DBIO cache uses efficient decompression algorithms, and outputs data in the optimal format for further processing using whole-stage code generation.
Automatic vs manual control

When using the RDD cache it is necessary to manually choose tables or queries to be cached.

When using the DBIO cache the data is added to the cache automatically whenever it has to be fetched from a remote source. This process is fully transparent and does not require any action from the user. However, the users who would prefer to preload all the necessary data into the cache beforehand can do so using the CACHE command (see Caching Selected Subsets of the Data).

Disk vs memory-based
Unlike the RDD cache, the DBIO cache is stored entirely on the local disk, so that the memory is not taken away from other operations within Spark. Thanks to the high reading speeds of modern SSDs, the DBIO cache can be fully disk-resident without a negative impact on its performance.

Tip

  • Use the RDD cache for storing the results of manually chosen subqueries.
  • Use the DBIO cache to automatically improve the scan performance in workflows reading Parquet files.

Configuration

Note

When using the AWS i3 instances as worker nodes with Runtime 3.3 or newer, the DBIO cache is enabled by default and preconfigured to use at most half of space available on the NVMe SSDs provided with the chosen instance type.

No further configuration is required and the steps described below can be skipped.

The DBIO cache has to be configured during cluster creation with the following settings:

spark.databricks.io.cache.maxDiskUsage "{DISK SPACE PER NODE RESERVED FOR CACHED DATA}"
spark.databricks.io.cache.maxMetaDataCache "{DISK SPACE PER NODE RESERVED FOR CACHED METADATA}"

As a guideline, we recommend using the following values:

  • For spark.databricks.io.cache.maxDiskUsage: use 50% of the total available storage space per node
  • For spark.databricks.io.cache.maxMetaDataCache: use 2% of the space dedicated to spark.databricks.io.cache.maxDiskUsage

By default, the DBIO cache data is stored on the instance local disk. For instance types which do not have a local disk, you can add extra EBS volumes to be used for cache storage during cluster creation.

../_images/ebs.png

Note

The DBIO cache shares the same disk space with Spark shuffle outputs. Dedicating too large portion of the storage to the cache can result in out of disk space errors.

See AWS Configurations for more details.

Enabling the Cache

The DBIO cache can be enabled or disabled at any moment with the following command:

spark.conf.set("spark.databricks.io.cache.enabled", "{true/false}")

Disabling the cache does not result in dropping the data that is already in the local storage. Instead, it prevents the queries from adding new data to the cache and/or reading data from the cache.

Caching Selected Subsets of the Data

It is possible to explicitly select a subset of data to be cached using following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

While using this command is not necessary for the DBIO to work correctly (the data will be automatically cached when first accessed), it can be helpful when consistent query performance is required.

See Cache for examples and more details.

Monitoring

The current state of the DBIO cache on each of the executors can be checked in the Storage tab in SparkUI.

../_images/DBIO-SparkUI.png

When a node reaches 100% Disk Usage, it will start to evict the least recently used cache entries to make space for new data.