Enable Hive metastore federation for AWS Glue metastores

This article shows how to federate an AWS Glue Hive metastore so that your organization can work with your Hive metastore tables using Unity Catalog.

For an overview of Hive metastore federation, see Hive metastore federation: enable Unity Catalog to govern tables registered in a Hive metastore.

Before you begin

Review the list of services and features supported by Hive metastore federation: Requirements, supported features, and limitations.

Specific requirements are listed for each step below.

Step 1: Connect Unity Catalog to your AWS Glue Hive metastore

To connect your AWS Glue Hive metastore to Databricks, you must:

Create an IAM role with a trust policy that gives Databricks temporary access to AWS Glue resources.
Create a service credential that registers the IAM role in Unity Catalog. A service credential is a Unity Catalog securable object that encapsulates a long-term cloud credential.
Create a Unity Catalog connection to your AWS glue catalog from Databricks. A connection is a Unity Catalog securable object that specifies a path and credentials for accessing a database system.

Requirements

You must have the following:

The ability to create an IAM role that grants access to your AWS Glue catalog.
The region and account ID of the AWS account that contains your AWS Glue catalog.
The CREATE SERVICE CREDENTIAL and CREATE CONNECTION privileges on the Unity Catalog metastore. Metastore admins have these privileges by default.

Create the IAM role and service credential

Use the instructions in Step 1: Create an IAM role to create an IAM role that gives Databricks temporary access to your AWS Glue resources. Because this IAM role must be self-assuming, the process requires three steps:

You should use the sample IAM policy for AWS Glue that is included in these instructions. The self-assuming role requirement is explained in these instructions as well.

Create the connection

To create a connection, you can use Catalog Explorer or the CREATE CONNECTION SQL command in a Databricks notebook or the Databricks SQL query editor.

note

You can also use the Databricks REST API or the Databricks CLI to create a connection. See POST /api/2.1/unity-catalog/connections and Unity Catalog commands.

Catalog Explorer
SQL

In your Databricks workspace, click Catalog.
On the Quick access page, click Add data > Add a connection.
On the Connection basics page of the Set up connection wizard, enter a user-friendly Connection name.
Select a Connection type of Hive Metastore and a Metastore type of AWS Glue.
(Optional) Add a comment.
Click Next.
On the Connection details page, enter the following for your host database:
- AWS Region: The region of the AWS account that holds the AWS Glue catalog.
- AWS Account ID: The Account ID of the AWS account.
- Credential: Select the service credential that you created in the previous step.
(Optional) If the database instance does not use a CA signed server certificate, select Trust server certificate.
Click Create connection.
On the Catalog basics page, enter a name for the foreign catalog.
For Authorized paths, choose cloud storage paths that can be accessed via the catalog. Only tables falling under these paths can be queried via the foreign catalog. Paths must be covered by external locations. For more information, see What are authorized paths?.
Click Create catalog.
On the Access page, select the workspaces in which users can access the catalog you created. You can select All workspaces have access, or click Assign to workspaces, select the workspaces, and then click Assign.
Add an Owner who will be able to manage access to all objects in the catalog. Start typing a user or group in the text box, and then click the user or group in the returned results.
Grant Privileges on the catalog.
1. Click Grant.
2. Specify the Principals who will have access to objects in the catalog. Start typing a user or group in the text box, and then click the user or group in the returned results.
3. Select the Privilege presets to grant to each user or group. All account users are granted BROWSE by default.
  - Select Data Reader from the drop-down menu to grant read privileges on objects in the catalog.
  - Select Data Editor from the drop-down menu to grant read and modify privileges on objects in the catalog.
  - Manually select the privileges to grant.
4. Click Grant.
Click Next.
On the Metadata page, optionally specify tags as key-value pairs. For more information, see Apply tags to Unity Catalog securable objects.
(Optional) Add a comment.
Click Save.

Run the following command in a notebook or the SQL query editor.

SQL
CREATE CONNECTION <connection-name> TYPE glue
OPTIONS (
  aws_region '<aws-region-ID>',
  aws_account_id '<aws-account-ID>',
  credential '<service-credential-ID>'
);

Step 2: Create external locations for data in your Hive metastore

In this step, you configure an external location in Unity Catalog to govern access to the cloud storage locations that hold the data registered in your AWS Glue Hive metastore.

External locations are Unity Catalog securable objects that associate storage credentials with cloud storage container paths. See Storage credentials and external locations.

Requirements

For cloud storage and Databricks permission requirements, see “Before you begin” in Create an external location to connect cloud storage to Databricks.

Options for creating the external location

The process that Databricks recommends for creating an external location in Unity Catalog depends on your situation:

If you do not have a storage credential created in Unity Catalog, you can create the external location using an AWS CloudFormation template that creates an IAM role and storage credential for you. Follow the instructions in Option 1: Create an external location for an S3 bucket using an AWS CloudFormation template.
If you already have a storage credential created in Unity Catalog, create the external location manually. Follow the instructions in Option 2: Create an external location manually using Catalog Explorer or Option 3: Create an external location using SQL.

Enable fallback mode on external locations

As soon as you create an external location in Unity Catalog, access to the path represented by that external location is enforced by Unity Catalog permissions when you run queries on Unity Catalog-enabled compute. This can interrupt existing workloads that don't have the correct Unity Catalog permissions to access the path.

When an external location is in fallback mode, the system first checks the querying principal's Unity Catalog permissions on the location, and if that doesn't succeed, falls back to using existing cluster- or notebook-scoped credentials, such as instance profiles or Apache Spark configuration properties, so that your existing workloads continue to run uninterrupted.

Fallback mode is convenient when you are in the process of migrating your legacy workload. Once you've updated your workloads to run successfully using Unity Catalog permissions, you should disable fallback mode to prevent legacy cluster-scoped credentials from being used to bypass Unity Catalog data governance.

You can enable fallback mode using Catalog Explorer or the Unity Catalog external locations REST API.

Permissions required: Owner of the external location.

Catalog Explorer
API

In your Databricks workspace, click Catalog.
On the Quick access page, click External data >.
Select the external location you want to update.
Turn on the Fallback mode toggle and click Enable to confirm.

The following curl examples show how to enable fallback mode when you create an external location and when you update an existing external location.

Creating a new external location:

Bash
curl -X POST -H 'Authorization: Bearer <token>' \
https://<workspace-URL>/api/2.1/unity-catalog/external-locations \
--data
'{
  "name": "fallback_mode_enabled_external_location",
  "url": "s3://external_location_bucket/url",
  "credential_name": "external_location_credential",
  "fallback": true
  "skip_validation": true
}'

Updating an external location:

Bash
curl -X PATCH \
-H 'Authorization: Bearer <token>' \
-H 'Content-Type: application/json' \
https://<workspace-URL>/api/2.1/unity-catalog/external-locations/<external-location-name> \
--data
 '{
   "comment": "fallback mode enabled",
   "fallback": true
  }'

Step 3: Create a foreign catalog

note

You may already have completed this step if you used the connection creation wizard in Catalog Explorer to complete Step 1. If you did not create the foreign catalog when you completed step 1, or if you used SQL to create the connection, you must follow the instructions in this section.

In this step, you use the connection that you created in step 1 to create a foreign catalog in Unity Catalog that points to the external location that you created in step 2. A foreign catalog is a securable object in Unity Catalog that mirrors a database or catalog in an external data system, enabling you to perform queries on that data in your Databricks workspace, with access managed by Unity Catalog. In this case, the mirrored catalog is your data registered in a Hive metastore.

Any time a user or workflow interacts with the foreign catalog, metadata is synced from the Hive metastore.

Requirements

Permission requirements:

To create the foreign catalog:

The CREATE CATALOG privilege on your Unity Catalog metastore.
Either ownership of the connection or the CREATE FOREIGN CATALOG privilege on the connection.
To enter authorized paths for the foreign catalog, you must have the CREATE FOREIGN SECURABLE privilege on an external location that covers those paths. The owner of the external location has this privilege by default.

To work with the foreign catalog:

Ownership of the catalog or USE CATALOG

Compute requirements:

To create the catalog using Catalog Explorer: no compute required.
To create the catalog using SQL: Databricks Runtime 13.3 LTS or above.
To work with the catalog: a compute with standard access mode on Databricks Runtime 13.3 LTS, 14.3 LTS, 15.1 or above.

Create the foreign catalog

To create a foreign catalog, you can use Catalog Explorer or the CREATE FOREIGN CATALOG SQL command in a Databricks notebook or the SQL query editor.

note

You can also use the Unity Catalog API. See Create a catalog in the Databricks REST API reference.

Catalog Explorer
SQL

In your Databricks workspace, click Catalog to open Catalog Explorer.
On the Quick access page, click the Add data botton and select Add a catalog .
Enter a Catalog name and select a catalog Type of Foreign.
Select the Connection that you created in Step 1 from the drop-down.
In the Authorized paths field, enter paths to the cloud storage locations that you defined as external locations in Step 2. For example, s3://demo, s3://depts/finance.

Authorized paths are an added layer of security for foreign catalogs backed by Hive metastore federation. See What are authorized paths?.
Click Create.
(Optional) Click Configure to open a wizard that walks you through granting permissions on the catalog and adding tags. You can also perform these steps later.

See Manage privileges in Unity Catalog and Apply tags to Unity Catalog securable objects.
(Optional) Bind the catalog to specific workspaces.

By default, catalogs can be accessed from any workspace attached to the Unity Catalog metastore (restricted by user privileges). If you want to allow access only from specific workspaces, go to the Workspaces tab and assign workspaces. See Limit catalog access to specific workspaces.
Populate the foreign catalog with the Hive metastore metadata.

Any time a user or workflow interacts with the foreign catalog, metadata is synced from the Hive metastore. The first interaction populates the catalog in Unity Catalog and makes its contents visible in the Catalog Explorer UI. You can populate the catalog by selecting and starting a supported compute resource in Catalog Explorer. You must be the catalog owner (which you are by virtue of creating the catalog) or a user with the USE CATALOG privilege.

Run the following SQL command in a notebook or the SQL query editor. Items in brackets are optional. Replace the placeholder values:

<catalog-name>: Name for the catalog in Databricks.
<connection-name>: The name of the connection object that you created in Step 1.
<path1>,<path2>: Paths to the cloud storage locations that you defined as external locations in Step 2. For example, s3://demo, s3://depts/finance. Authorized paths are an added layer of security for foreign catalogs backed by Hive metastore federation. See What are authorized paths?. See What are authorized paths?. If you don't add authorized paths when you create the catalog, you can use ALTER CATALOG to add them later. See ALTER CATALOG.

SQL
CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS (authorized_paths '<path1>,<path2>');

Any time a user or workflow interacts with the federated catalog, metadata is synced from the Hive metastore.

Before you begin​

Step 1: Connect Unity Catalog to your AWS Glue Hive metastore​

Requirements​

Create the IAM role and service credential​

Create the connection​

Step 2: Create external locations for data in your Hive metastore​

Requirements​

Options for creating the external location​

Enable fallback mode on external locations​

Step 3: Create a foreign catalog​

Requirements​

Create the foreign catalog​

Before you begin

Step 1: Connect Unity Catalog to your AWS Glue Hive metastore

Requirements

Create the IAM role and service credential

Create the connection

Step 2: Create external locations for data in your Hive metastore

Requirements

Options for creating the external location

Enable fallback mode on external locations

Step 3: Create a foreign catalog

Requirements

Create the foreign catalog