Connect to cloud object storage using Unity Catalog

Databricks recommends using Unity Catalog to manage access to all data stored in cloud object storage.

Unity Catalog provides a suite of tools to configure secure connections to cloud object storage. These connections provide access to complete the following actions:

  • Ingest raw data into a lakehouse.

  • Create and read managed tables in secure cloud storage.

  • Register or create external tables containing tabular data.

  • Read and write unstructured data.

This article introduces the constructs used to configure and govern access to data in cloud object storage using Unity Catalog.

Warning

Do not give end users storage-level access to Unity Catalog managed tables or volumes. This compromises data security and governance.

Avoid granting users IAM roles that give direct access to AWS S3 buckets that are used as Unity Catalog managed storage. The only identity that should have access to data managed by Unity Catalog is the identity used by Unity Catalog. Ignoring this creates the following issues in your environment:

  • Access controls established in Unity Catalog can be circumvented by users who have direct access to S3.

  • Auditing, lineage, and other Unity Catalog monitoring features will not capture direct access.

  • The lifecycle of data is broken. That is, modifying, deleting, or evolving tables in Databricks will break the consumers that have direct access to storage.

Note

If your workspace was created before November 8, 2023, it may not be enabled for Unity Catalog. An account admin must enable Unity Catalog for your workspace. See Enable a workspace for Unity Catalog.

How does Unity Catalog connect object storage to Databricks?

Unity Catalog provides several layers of granularity to manage access to data in cloud object storage. Admins must complete the initial configuration, but they can grant privileges to other users for defining new connections and objects. See Manage privileges in Unity Catalog.

A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant, using an IAM role. Each storage credential is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage credential in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf. Permissions for storage credentials should only be granted to users that need to define external locations. See Create a storage credential.

An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path. Each storage location is subject to Unity Catalog access-control policies that control which users and groups can access the credential. If a user does not have access to a storage location in Unity Catalog, the request fails and Unity Catalog does not attempt to authenticate to your cloud tenant on the user’s behalf. Permission to create and use external locations should only be granted to users who need to create external tables, external volumes, or managed storage locations. See Create an external location.

A managed storage location is a location in cloud object storage associated with a metastore, catalog, or schema. Managed tables and managed volumes are created in managed storage locations. Databricks recommends configuring managed storage locations at the catalog level. You can optionally specify a managed storage location at the metastore level to provide default storage when no catalog-level storage is defined. If you need more granular isolation, you can specify managed storage locations at the schema level. See Specify a managed storage location in Unity Catalog and Unity Catalog best practices.

Volumes are the primary construct most Databricks users should use to interact directly with data in cloud object storage. See Create and work with volumes.

Note

While Unity Catalog supports path-based access to external tables and external volumes using cloud storage URIs, Databricks recommends reading and writing all Unity Catalog tables using table names and accessing data in volumes using the provided /Volumes paths.

Next steps

If you’re just getting started with Unity Catalog as an admin, see Set up and manage Unity Catalog.

If you’re a new user and your workspace is already enabled for Unity Catalog, see Tutorial: Create your first table and grant privileges in Unity Catalog.