Skip to main content

Connect to Azure Data Lake Storage

note

Only SAP Databricks accounts deployed on Azure can connect to Azure Data Lake Storage (ADLS).

If your ADLS storage is firewalled, contact your Databricks account team for support on how to allowlist Databricks on those firewalls.

This article describes how to create a storage credential and external location to connect to Azure Data Lake Storage (ADLS).

A storage credential contains a long-term cloud credential that provides access to cloud storage. You reference storage credentials, along with the cloud storage path, when you create external locations in Unity Catalog to govern access to external storage.

Requirements

  • In SAP Databricks, you must have the CREATE STORAGE CREDENTIAL privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.
  • In your Azure tenant:
    • You must have access to an Azure Data Lake Storage storage container. To avoid egress charges, this should be in the same region as the workspace you want to access the data from. The storage account must have a hierarchical namespace.
    • Contributor or Owner of an Azure resource group.
    • Owner or a user with the User Access Administrator Azure RBAC role on the storage account.

Create a storage credential that accesses Azure Data Lake Storage

You will use a service principal to authorize access to your storage account. First, you will create a service principal, then assign permission to the storage account, and finally create the storage credential.

Create a service principal

Registering a Microsoft Entra ID application and assigning appropriate permissions will create a service principal that can access Azure Data Lake Storage or Blob Storage resources.

To register a Microsoft Entra ID application, you must have the Application Administrator role or the Application.ReadWrite.All permission in Microsoft Entra ID.

  1. In the Azure portal, go to the Microsoft Entra ID service.
  2. Under Manage, click App Registrations.
  3. Click + New registration. Enter a name for the application and click Register.
  4. Click Certificates & Secrets.
  5. Click + New client secret.
  6. Add a description for the secret and click Add.
  7. Copy and save the value for the new secret.
  8. In the application registration overview, copy and save the Application (client) ID and Directory (tenant) ID.

Assign permission to the storage account

You control access to storage resources by assigning the Microsoft Entra ID application to the storage account. To assign roles on a storage account you must have the Owner or User Access Administrator Azure RBAC role on the storage account.

  1. In the Azure portal, go to the Storage accounts service.
  2. Select an Azure storage account to use with this application registration.
  3. Click Access Control (IAM).
  4. Click + Add and select Add role assignment from the dropdown menu.
  5. Set the Select field to the Microsoft Entra ID application name and set Role to Storage Blob Data Contributor.
  6. Click Save.

To enable file event access on the storage account, you must have the Owner or User Access Administrator Azure RBAC role on the Azure resource group that your Azure Data Lake Storage account is in.

  1. Follow the steps above, but add the additional roles of Storage Queue Data Contributor and the Storage Account Contributor.
  2. Navigate to the Azure resource group that your Azure Data Lake Storage account is in.
  3. Go to Access Control (IAM), click + Add, and select Add role assignment.
  4. Select the EventGrid EventSubscription Contributor role and click Next.
  5. Under Assign access to, select Service Principal.
  6. Click +Select Members, select your service principal, and click Review and Assign.

Alternatively, you can limit access by only granting the Storage Queue Data Contributor role the service principal and granting no roles to your resource group. In this case, SAP Databricks cannot configure file events on your behalf.

Create the storage credential

To create a storage credential using a service principal, you must be a SAP Databricks account admin. The account admin who creates the service principal storage credential can delegate ownership to another user or group to manage permissions on it.

You cannot add a service principal storage credential using Catalog Explorer. Instead, use the Storage Credentials API. For example:

Bash
curl -X POST -n \
https://<databricks-instance>/api/2.1/unity-catalog/storage-credentials \
-d '{
"name": "<storage-credential-name>",
"read_only": true,
"azure_service_principal": {
"directory_id": "<directory-id>",
"application_id": "<application-id>",
"client_secret": "<client-secret>"
},
"skip_validation": "false"
}'

You can also create a storage credential by using Terraform. See databricks_storage_credential.

Create an external location

The external location is used to govern access to external storage.

  1. Log in to a workspace that is attached to the metastore.

  2. In the sidebar, click Data icon. Catalog.

  3. On the Quick access page, click the External data > button, go to the External Locations tab, and click Create location.

  4. Enter an External location name.

  5. Select the Storage type: Azure Data Lake Storage or R2

  6. Under URL, enter or select the path to the external location.

  7. Select the storage credential that grants access to the external location.

  8. (Optional) If you want users to have read-only access to the external location, click Advanced Options and select Read only.

  9. (Optional) If the external location is intended for a Hive metastore federated catalog, click Advanced options and enable Fallback mode.

  10. (Optional) To enable the ability to subscribe to change notifications on the external location, click Advanced Options and select Enable file events.

  11. Click Create.

  12. (Optional) Bind the external location to specific workspaces.

    By default, any privileged user can use the external location on any workspace attached to the metastore. If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See Bind an external location to one or more workspaces.

  13. Go to the Permissions tab to grant permission to use the external location.

    For anyone to use the external location, you must grant permissions:

    • To use the external location to add a managed storage location to metastore, catalog, or schema, grant the CREATE MANAGED LOCATION privilege.
    • To create external tables or volumes, grant CREATE EXTERNAL TABLE or CREATE EXTERNAL VOLUME.

Bind an external location to one or more workspaces

To assign an external location to specific workspaces, you can use Catalog Explorer.

Permissions required: Metastore admin, external location owner, or MANAGE on the external location.

Metastore admins can see all external locations in a metastore using Catalog Explorer—and external location owners can see all external locations that they own in a metastore—regardless of whether the external location is assigned to the current workspace. External locations that are not assigned to the workspace appear grayed out.

  1. Log in to a workspace that is linked to the metastore.

  2. In the sidebar, click Data icon. Catalog.

  3. On the Quick access page, click the External data > button to go to the External Locations tab.

  4. Select the external location and go to the Workspaces tab.

  5. On the Workspaces tab, clear the All workspaces have access checkbox.

    If your external location is already bound to one or more workspaces, this checkbox is already cleared.

  6. Click Assign to workspaces and enter or find the workspaces you want to assign.

To revoke access, go to the Workspaces tab, select the workspace, and click Revoke. To allow access from all workspaces, select the All workspaces have access checkbox.