Connect to cloud object storage and services using Unity Catalog
This article gives an overview of the cloud storage connections that are required to work with data using Unity Catalog, along with information about how Unity Catalog governs access to cloud storage and external cloud services.
Note
If your workspace was created before November 8, 2023, it might not be enabled for Unity Catalog. An account admin must enable Unity Catalog for your workspace. See Enable a workspace for Unity Catalog.
How does Unity Catalog use cloud storage?
Databricks recommends using Unity Catalog to manage access to all data that you have stored in cloud object storage. Unity Catalog provides a suite of tools to configure secure connections to cloud object storage. These connections provide access to complete the following actions:
Ingest raw data into a lakehouse.
Create and read managed tables and managed volumes of unstructured data in Unity Catalog-managed cloud storage.
Register or create external tables containing tabular data and external volumes containing unstructured data in cloud storage that is managed using your cloud provider.
Read and write unstructured data (as Unity Catalog volumes).
To be more specific, Unity Catalog uses cloud storage in two primary ways:
Default (or “managed”) storage locations for managed tables and managed volumes (unstructured, non-tabular data) that you create in Databricks. These managed storage locations can be defined at the metastore, catalog, or schema level. You create managed storage locations in your cloud provider, but their lifecycle is fully managed by Unity Catalog.
Storage locations where external tables and volumes are stored. These are tables and volumes whose access from Databricks is managed by Unity Catalog, but whose data lifecycle and file layout are managed using your cloud provider and other data platforms. Typically you use external tables to register large amounts of your existing data in Databricks, or if you also require write access to the data using tools outside of Databricks.
For more information about managed vs external tables and volumes, see What are tables and views? and What are Unity Catalog volumes?.
Warning
Do not give end users storage-level access to Unity Catalog managed tables or volumes. This compromises data security and governance.
Avoid granting users direct access to Amazon S3 or Cloudflare R2 buckets that are used as Unity Catalog managed storage. The only identity that should have access to data managed by Unity Catalog is the identity used by Unity Catalog. Ignoring this creates the following issues in your environment:
Access controls established in Unity Catalog can be circumvented by users who have direct access to S3 or R2 buckets.
Auditing, lineage, and other Unity Catalog monitoring features will not capture direct access.
The lifecycle of data is broken. That is, modifying, deleting, or evolving tables in Databricks will break the consumers that have direct access to storage, and writes outside of Databricks could result in data corruption.
Which cloud storage providers are supported?
Databricks on AWS supports both AWS S3 and Cloudflare R2 buckets as cloud storage locations for data assets registered in Unity Catalog. R2 is intended primarily for uses cases in which you want to avoid data egress fees, such as Delta Sharing across clouds and regions. For more information, see Use Cloudflare R2 replicas or migrate storage to R2.
How does Unity Catalog govern access to cloud storage?
To manage access to the underlying cloud storage that holds tables and volumes, Unity Catalog uses a securable object called an external location, which defines a path to a cloud storage location and the credentials required to access that location. Those credentials are, in turn, defined in a Unity Catalog securable object called a storage credential. By granting and revoking access to external location securables in Unity Catalog, you control access to the data in the cloud storage location. By granting and revoking access to storage credential securables in Unity Catalog, you control the ability to create external location objects.
For details, see Manage access to cloud storage using Unity Catalog.
Path-based access to cloud storage
Although Unity Catalog supports path-based access to external tables and external volumes using cloud storage URIs, Databricks recommends that users read and write all Unity Catalog tables using table names and access data in volumes using /Volumes
paths. Volumes are the securable object that most Databricks users should use to interact directly with non-tabular data in cloud object storage. See What are Unity Catalog volumes?.
How does Unity Catalog govern access to other cloud services?
Unity Catalog governs access to non-storage services using a securable object called a service credential. A service credential encapsulates a long-term cloud credential that provides access to an external service that users need to connect to from Databricks.
Service credentials are not intended for governing access to cloud storage that is used as a Unity Catalog managed storage location or external storage location. For those use cases, use a storage credential, as described in How does Unity Catalog govern access to cloud storage?.
Note
Service credentials are the Unity Catalog alternative to instance profiles, with the advantage that access is not tied to a specific compute resource but instead to users, groups, or service principals.
For details, see: