Where’s my data?

Databricks uses a shared responsibility model to create, configure, and access block storage volumes and object storage locations in your cloud account. Loading data to or saving data with Databricks results in files stored in either block storage or object storage. The following matrix provides a quick reference:

Operation

Location

UI data upload

Object storage

DBFS file upload

Object storage

Upload data with Auto Loader

Object storage

Upload data with COPY INTO

Object storage

Create table

Object storage

Save data with Apache Spark

Object storage

Save data with pandas

Block storage

Download data from web in a notebook

Block storage

What is object storage?

In cloud computing, object storage or blob storage refers to storage containers that maintain data as objects, with each object consisting of data, metadata, and a globally unique resource identifier (URI). Data manipulation operations in object storage are often limited to create, read, update, and delete (CRUD) through a REST API interface. Some object storage offerings include features like versioning and lifecycle management. Object storage has the following benefits:

  • High availability, durability, and reliability.

  • Lower cost for storage compared to most other storage options.

  • Infinitely scalable (limited by the total amount of storage available in a given region of the cloud).

Most cloud-based data lakes are built on top of open source data formats in cloud object storage.

How does Databricks use object storage?

Object storage is the main form of storage used by Databricks for most operations. The Databricks Filesystem (DBFS) allows Databricks users to interact with files in object storage similar to how they would in any other file system. Unless you specifically configure a table against an external data system, all tables created in Databricks store data in cloud object storage.

Delta Lake files stored in cloud object storage provide the data foundation for the Databricks Lakehouse.

How do you configure cloud object storage for Databricks?

Databricks uses cloud object storage to store data files and tables. During workspace deployment, Databricks configures a cloud object storage location known as the DBFS root. You can configure connections to other cloud object storage locations in your account.

In almost all cases, the data files you interact with using Apache Spark on Databricks are stored in cloud object storage. See the following articles for guidance on configuring connections:

What is block storage?

In cloud computing, block storage or disk storage refer to storage volumes that correspond to traditional hard disk drives (HDDs) or solid state drives (SSDs), also known simply as “hard drives”. When deploying block storage in a cloud computing environment, typically a logical partition of one or more physical drives are deployed. Implementations vary slightly between product offerings and cloud vendors, but the following characteristics are typically found across implementations:

  • All virtual machines (VMs) require an attached block storage volume.

  • Files and programs installed to a block storage volume persist as long as the block storage volume persists.

  • Block storage volumes are often used for temporary data storage.

  • Block storage volumes attached to VMs are usually deleted alongside VMs.

How does Databricks use block storage?

When you turn on compute resources, Databricks configures and deploys VMs and attaches block storage volumes. This block storage is used for storing ephemeral data files for the lifetime of the compute. These files include the operating system and installed libraries, in addition to data used by the disk cache. While Apache Spark uses block storage in the background for efficient parallelization and data loading, most code run on Databricks does not directly save or load data to block storage.

You can run arbitrary code such as Python or Bash commands that use the block storage attached to your driver node. See Access files on the driver filesystem.

In workspaces with workspace files enabled, Python users can save and load data and files stored alongside notebooks instead of needing to interact with block storage on the driver. See Programmatically interact with workspace files.