Skip to main content

Phase 5: Design storage architecture

In this phase, you design storage infrastructure for Databricks workspaces and Unity Catalog.

To process data on Databricks, cloud storage must be configured. There are two kinds of storage within Databricks:

  • Databricks-managed storage (default storage): Storage in the Databricks-owned cloud account. Used by serverless workspaces for workspace root storage and optionally for Unity Catalog catalogs.
  • Customer-managed storage: Storage in the customer's cloud account. Used by classic workspaces for workspace storage and Unity Catalog storage.

Both storage types use the same underlying cloud services.

Design workspace storage architecture

The workspace storage account/bucket is a mandatory part of a Databricks workspace and is used for several purposes:

  • Storage for the default Unity Catalog catalog created for this workspace.
  • Other internally-generated data used by different services on the platform, such as MLflow experiments and MLflow registry models' default location, Lakeflow Spark Declarative Pipelines default location, and Cloud Fetch.

Workspace storage patterns

Using a customer-managed virtual network gives you more control over the workspace system bucket. You create the root bucket first and then assign it to a Databricks workspace, retaining full control over the bucket policy while ensuring the workspace can still access it.

Best practices for workspace storage

  • Use Unity Catalog external locations to override default workspace storage locations.
  • Disallow uploads from the web app unless absolutely necessary (configure from the admin page).
  • Do not use the workspace root bucket (DBFS) for production customer data.
  • Understand the risks and workarounds for the root bucket.

DBFS migration

DBFS (Databricks File System) should not be used for new production data. For existing deployments:

  • Structured data: Migrate DBFS tables to Unity Catalog managed tables.
  • Unstructured data: Migrate files from DBFS to Unity Catalog volumes for POSIX-style file access.
  • Workspace files: Continue using workspace storage (not DBFS) for notebooks, libraries, and cluster logs.

Design Unity Catalog storage architecture

Unity Catalog metastores support three types of objects that determine how and where data is stored: managed, external, and foreign.

Storage object types

Managed objects

For managed objects, a managed storage location specifies a location in cloud object storage for storing data for managed tables and managed volumes. You can associate a managed storage location with a metastore, catalog, or schema. Managed storage locations at lower levels in the hierarchy override storage locations defined at higher levels when managed tables or managed volumes are created.

important

Relying on default locations is not recommended, as it can lead to unintended data co-location and complicate access control and lifecycle management. Explicitly define storage locations at the catalog or schema level.

External objects

External objects store their data in external locations. External locations associate Unity Catalog storage credentials with cloud object storage containers (for example, Amazon S3 buckets, Azure containers, or Google Cloud Storage buckets). External locations are used to:

  • Define managed storage locations for catalogs and schemas.
  • Define locations for external tables and external volumes.

Foreign objects

A foreign catalog specifies a connection to an external data system for accessing remote tables and schemas. You can associate a foreign catalog with metadata from external sources such as Hive Metastore, AWS Glue, or Snowflake Horizon. Foreign catalogs provide read-only access to the remote system's database objects, enabling you to query external tables and schemas through Unity Catalog without replicating the data.

Cloud storage architecture

AWS storage architecture

In Databricks AWS accounts, a Unity Catalog metastore has:

  • Zero or one S3 bucket for the default storage of managed tables at the metastore level.
  • Zero or more buckets at the catalog or schema level for managed tables.
  • Zero or more S3 buckets for external tables.

Similar to workspaces, the S3 buckets for storage under Unity Catalog can belong to different AWS accounts. For all buckets (storage locations), you need to set up a cross-account IAM role (storage credential) with a trust relationship so that Unity Catalog can assume the role to access data in the bucket on behalf of Databricks users.

Storage isolation patterns

Separate storage by environment

Use different storage containers for development, staging, and production environments. This provides clear boundaries and prevents accidental production data access from lower environments.

Separate storage by business unit

When business units require complete data segregation for governance or billing purposes, use separate storage containers with separate storage credentials for each business unit.

Separate storage by data domain

In data mesh architectures, each domain should have its own storage containers managed by domain-specific storage credentials.

Multi-region storage design

If multiple regions use Databricks, storage architecture must account for data locality and cross-region access patterns:

  • Deploy storage containers in the same region as the metastore for performance.
  • Use Databricks-managed (D2D) Delta Sharing to share data between regions.
  • Evaluate the frequency and volume of data access across regions to determine if data replication pipelines are needed.
  • Consider egress costs when accessing data across regions.
warning

Do not register shared tables as external tables in more than one metastore. The risk is that any changes to the schema, table properties, and comments that occur as a result of writes to metastore A will not register at all with metastore B. Tables in metastore B would have to be recreated to have the correct schema, and table properties and comments would be entirely disconnected. This can also cause consistency issues with the Delta Commit service. Use D2D Delta Sharing for sharing data between metastores.

Design access and authentication strategy

Databricks recommends using Unity Catalog to manage access to all data, and in particular recommends using managed tables wherever possible. Unity Catalog fully manages storage layout, metadata, and governance for managed tables.

To manage access to external cloud storage that holds tables and volumes, Unity Catalog uses external locations, which define a path to cloud storage and the credentials required to access that location. Beyond cloud storage, Unity Catalog also manages permissions for tables, models, and other assets, and can federate to external catalogs.

Authentication methods by cloud

AWS authentication architecture

In AWS, use cross-account IAM roles for Unity Catalog to access customer storage:

  1. Create an IAM role with a trust policy that allows Databricks to assume the role
  2. Attach IAM policies granting S3 permissions to specific S3 buckets or prefixes
  3. Register the IAM role ARN as a storage credential in Unity Catalog
  4. Create external locations that use the storage credential

Best practices for AWS authentication

  • Use separate IAM roles for different storage buckets or data domains.
  • Apply least-privilege permissions (only grant access to specific S3 prefixes).
  • Enable AWS CloudTrail to audit access to S3 buckets.
  • Use bucket policies as an additional security layer.

Design encryption strategy

Customer-managed storage can be encrypted using standard cloud practices. By default, data is encrypted at rest using the cloud provider's encryption and a Databricks-managed key.

Encryption options

Databricks-managed keys (default)

By default, your data is encrypted at rest using the cloud provider's encryption and a Databricks-managed key. This provides baseline encryption with no additional configuration required.

Customer-managed keys

For organizations that require customer-managed keys, every cloud supports them. Customer-managed keys are generally used for two purposes:

Purpose 1: Control plane and managed services encryption

Encrypt customer data in the Databricks control plane, default storage, and supported serverless services that store customer data at rest (such as vector search, query results, code, and secrets) with a key under customer control. If the customer removes the key or removes access to the key, all data related to that workspace on the control plane becomes inaccessible.

Purpose 2: Compute and data plane encryption

Encrypt customer data on the customer compute and data planes for specific services. Define a customer-managed key for storage encryption purposes so that Databricks uses it to encrypt the root bucket and storage volumes connected to clusters.

note

On AWS, S3 always encrypts data using an S3-managed KMS key transparently. A customer-managed key is only needed to explicitly encrypt data with a key managed by the customer, providing an additional security layer linked with access to the key itself.

Encryption patterns

Highly regulated environments

Use customer-managed keys for both control plane and compute/data plane encryption to maintain full control over encryption keys and meet compliance requirements.

Standard enterprise deployments

Use Databricks-managed keys for most workloads, reserving customer-managed keys for production workspaces with sensitive data.

Multi-tenant deployments

Consider using separate customer-managed keys for different business units or environments to provide encryption isolation.

Design storage network security

The network access to cloud storage can be limited as an additional security layer. If credentials are leaked, network access controls prevent their use.

Network security patterns

  • Use S3 bucket policies to restrict access to specific VPCs or VPC endpoints.

Best practices for storage network security

  • Limit storage access to specific virtual networks or subnets.
  • Use VPC endpoints for private connectivity.
  • Enable storage service logging to audit access attempts.
  • Configure network rules before granting broad storage permissions.

Hub-and-spoke storage design

The hub-and-spoke storage design pattern is a common architecture for enterprise Unity Catalog deployments. This pattern centralizes shared data assets in hub storage while allowing domain-specific data in spoke storage.

Hub-and-spoke storage characteristics

  • Hub storage: Contains organization-wide shared data assets (for example, customer master data, reference data, centrally curated datasets).
  • Spoke storage: Contains domain-specific data owned and managed by business units (for example, sales analytics, marketing campaigns).
  • Storage separation: Hub and domain catalogs use dedicated storage with separate storage credentials.
  • Managed tables preference: For structured data in the lakehouse, use managed tables.
  • Volumes for raw data: Use volumes to access landing, raw, or unstructured data (which can sit outside the lakehouse because third parties usually require access to these storage locations directly).
  • External tables for sharing: Use external tables to share data outside of the lakehouse (to other systems not able to use Delta Sharing or that need direct access to the storage location).

Best practices for hub-and-spoke storage design

  • Use hub storage for organization-wide shared data that multiple domains consume.
  • Use spoke storage for domain-specific data owned by business units.
  • Separate storage credentials and external locations for hub and each spoke.
  • Use Databricks-managed Delta Sharing to share data from hub to spokes.
  • Document data ownership and lineage for hub and spoke storage.
  • Note: The metastore storage is now optional and recommended not to be used.

Storage architecture recommendations

Recommended

  • Use Unity Catalog managed tables and do not provide storage-level access to buckets.
  • Explicitly define storage locations at the catalog or schema level rather than relying on metastore-level defaults.
  • Separate storage by environment (for example, dev, staging, production) using different storage containers.
  • Use volumes for Unity Catalog-secured Portable Operating System Interface (POSIX)-style file paths.
  • Avoid legacy data access patterns such as mounting cloud storage and instance profiles wherever possible.
  • Evaluate whether customer-managed encryption keys (for both managed services and storage) are needed for increased control over data at rest.
  • Use Databricks-managed Delta Sharing to share tables across clouds and regions.

Avoid

  • Do not use the root bucket (DBFS) for storage of customer data.
  • Do not store production data on DBFS (Databricks File System).
  • Do not register external tables across regions (metastores).
  • Do not provide storage-level access (for example, S3 bucket access, ADLS container access) directly to users.
  • Do not rely on default metastore-level storage locations for production data.

Phase 5 outcomes

After completing Phase 5, you should have:

  • Storage architecture designed for workspaces (customer-managed vs Databricks-managed).
  • Unity Catalog storage architecture designed (managed vs external vs foreign objects).
  • Storage isolation strategy defined (by environment, business unit, or data domain).
  • Authentication strategy designed (for example, IAM roles, Access Connectors, or service accounts).
  • Encryption strategy selected (Databricks-managed keys vs customer-managed keys).
  • Storage network security patterns defined (for example, bucket policies, storage firewalls, VPC endpoints).
  • Multi-region storage considerations documented (if applicable).
  • Hub-and-spoke storage design evaluated (for enterprise deployments).
  • DBFS migration strategy planned (for existing deployments with DBFS data).

Next phase: Phase 6: Design Delta Lake architecture

Implementation guidance: For step-by-step instructions to implement your storage design, see Connect to cloud object storage using Unity Catalog.