Using AWS Glue Data Catalog as the Metastore for Databricks Runtime

You can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in replacement for a Hive metastore, with some limitations. It also enables multiple Databricks workspaces to share the same metastore.

This topic explains how to set up IAM roles and use them in Databricks to securely access a Glue Data Catalog.

Each AWS account owns a single catalog in an AWS region whose catalog ID is the same as the AWS account ID.

Note

  • Using the Glue Catalog as the metastore can potentially enable a shared metastore across AWS services, applications, or AWS accounts.
  • If you created tables using Amazon Athena or Amazon Redshift Spectrum before August 14, 2017, databases and tables are stored in an Athena-managed catalog, which is separate from the AWS Glue Data Catalog. To integrate Databricks Runtime with these tables, you must upgrade to the AWS Glue Data Catalog. For more information, see Upgrading to the AWS Glue Data Catalog in the Amazon Athena User Guide.

Requirements

  • Databricks Runtime 5.4 and above.
  • AWS administrator access to IAM roles and policies in the AWS account of the Databricks deployment and the AWS account of the Glue Data Catalog.
  • Target Glue Data Catalog. This catalog must belong to the same AWS account as the Databricks deployment or there must be a cross-account access policy that allows access to this catalog from the AWS account of the Databricks deployment.

Configure Glue Data Catalog as the metastore

To enable Glue Catalog integration, set the Spark configuration spark.databricks.hive.metastore.glueCatalog.enabled true. This configuration is disabled by default. That is, the default is to use the Databricks hosted Hive metastore, or some other external metastore if configured.

For interactive or jobs clusters, set the configuration in the cluster configuration before cluster startup.

Important

This configuration option cannot be modified in a running cluster.

When running spark-submit jobs, set this config option either in the spark-submit parameters using --conf spark.databricks.hive.metastore.glueCatalog.enabled=true or set it in code before creating the SparkSession or SparkContext. For example:

from pyspark.sql import SparkSession

# Set the Glue confs here directly instead of using the --conf option in spark-submit
spark = SparkSession.builder.           \
  appName("ExamplePySparkSubmitTask").  \
  config("spark.databricks.hive.metastore.glueCatalog.enabled", "true"). \
  enableHiveSupport(). \
  getOrCreate()
print(spark.sparkContext.getConf().get("spark.databricks.hive.metastore.glueCatalog.enabled"))
spark.sql("show databases").show()
spark.stop()

Perform steps to configure Glue Catalog access according to the account and region of your Databricks deployment as follows:

  • Same AWS account and region: Steps 1, 3-5
  • Cross-account: Steps 1-6
  • Cross-region: Steps 1, 3-6

Step 1: Create an IAM role and policy to access a Glue Data Catalog

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click Create role.

    1. Under Select type of trusted entity, select AWS service.

    2. Click the EC2 service.

    3. Under Select your use case, click EC2.

      ../../_images/select-service.png
    4. Click Next: Permissions and click Next: Review.

    5. In the Role name field, type a role name.

    6. Click Create role. The list of roles displays.

  4. In the role list, click the role.

  5. Add an inline policy to the Glue Catalog.

    1. In the Permissions tab, click Inline policy.

    2. Click the JSON tab.

    3. Copy and paste this policy into the tab.

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "GrantCatalogAccessToGlue",
            "Effect": "Allow",
            "Action": [
              "glue:BatchCreatePartition",
              "glue:BatchDeletePartition",
              "glue:BatchGetPartition",
              "glue:CreateDatabase",
              "glue:CreateTable",
              "glue:CreateUserDefinedFunction",
              "glue:DeleteDatabase",
              "glue:DeletePartition",
              "glue:DeleteTable",
              "glue:DeleteUserDefinedFunction",
              "glue:GetDatabase",
              "glue:GetDatabases",
              "glue:GetPartition",
              "glue:GetPartitions",
              "glue:GetTable",
              "glue:GetTables",
              "glue:GetUserDefinedFunction",
              "glue:GetUserDefinedFunctions",
              "glue:UpdateDatabase",
              "glue:UpdatePartition",
              "glue:UpdateTable",
              "glue:UpdateUserDefinedFunction"
            ],
            "Resource": [
              "*"
            ]
          }
        ]
      }
      

For fine-grained configuration of allowed resources (catalog, database, table, userDefinedFunction), refer to Specifying AWS Glue Resource ARNs.

If the list of allowed actions in the policy above is insufficient, contact Databricks Support with the error information. The simplest workaround is to use a policy that gives full access to Glue:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantFullAccessToGlue",
      "Effect": "Allow",
      "Action": [
        "glue:*"
      ],
      "Resource": "*"
    }
  ]
}

Step 2: Create a policy for the target Glue Catalog

Skip this step if the target Glue Catalog is in the same AWS account as the one used for Databricks deployment.

Log in to the AWS account of the target Glue Catalog and go to the Glue Console. In Settings, paste in the following policy into the Permissions box. Set <aws-account-id-databricks>, <iam-role-for-glue-access> from Step 1, <aws-region-target-glue-catalog>, <aws-account-id-target-glue-catalog>, accordingly.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example permission",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
      },
      "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
      ],
      "Resource": "arn:aws:glue:<aws-region-target-glue-catalog>:<aws-account-id-target-glue-catalog>:*"
    }
  ]
}

Step 3: Look up the IAM role used to create the Databricks deployment

This IAM role is the role you used when setting up the Databricks account.

  1. As the account owner, log in to the Account Console.

  2. Click the AWS Account tab.

  3. Note the role name at the end of the Role ARN, here testco-role.

    ../../_images/iam-role.png

Step 4: Add the Glue Catalog IAM role to the EC2 policy

  1. In the AWS console, go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click the role you noted in Step 3.

  4. On the Permissions tab, click the policy.

  5. Click Edit Policy.

  6. Modify the policy to allow Databricks to pass the IAM role you created in Step 1 to the EC2 instances for the Spark clusters. Here is an example of what the new policy should look like. Replace <iam-role-for-glue-access> with the role you created in Step 1.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateKeyPair",
            "ec2:CreatePlacementGroup",
            "ec2:CreateRoute",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcPeeringConnection",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteKeyPair",
            "ec2:DeletePlacementGroup",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribePlacementGroups",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:ModifyVpcAttribute",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
            "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-glue-access>"
        }
      ]
    }
    
  7. Click Review policy.

  8. Click Save changes.

Step 5: Add the Glue Catalog IAM role to a Databricks workspace

  1. Go to the Admin Console.

  2. Click the IAM Roles tab.

  3. Click the Add IAM Role button. A dialog displays.

  4. Paste in the Instance Profile ARN from Step 1.

    ../../_images/add-instanceprofile-arn.png

    Databricks validates that this Instance Profile ARN is both syntactically and semantically correct. To validate semantic correctness, Databricks does a dry run by launching a cluster with this IAM role. Any failure in this dry run produces a validation error in the UI.

    Note

    The validation of IAM role can fail if the role contains the tag-enforcement policy, preventing you from adding a legitimate IAM role. If the validation fails and you still want to add the role to Databricks, use the Instance Profiles API and specify skip_validation.

  5. Click Add.

  6. Optionally specify the users who can launch clusters with the IAM role.

    ../../_images/add-permissions.png

Step 6: Launch a cluster with the Glue Catalog IAM role

  1. Create a cluster.

  2. In the Databricks Runtime Version drop-down, select Databricks Runtime 5.4 or above.

  3. Click the Instances tab on the cluster creation page.

  4. In the IAM Role drop-down list, select the IAM role.

    ../../_images/create-cluster-with-iamrole.png
  5. Verify that you can access the Glue Catalog, using the following command in a notebook:

    %sql show databases;
    

If the command succeeds, this Databricks Runtime cluster is done with configuring to use Glue.

If the AWS account of the Databricks deployment and the AWS account of the Glue Data Catalog are different, extra cross-account setup is needed.

  1. Set spark.hadoop.hive.metastore.glue.catalogid <aws-account-id-for-glue-catalog> in Spark configuration.
  2. If the target Glue Catalog is in a different region than the Databricks deployment, also specify spark.hadoop.aws.region <aws-region-for-glue-catalog>.

Limitations

Databricks limitations

  • The default database is created with a location set to a URI using the dbfs: (Databricks File System) scheme. This location is not accessible from AWS applications outside Databricks, for example, from AWS EMR or AWS Athena and so on. As a workaround, use the LOCATION clause to specify a bucket location, such as s3://mybucket, when you call CREATE TABLE. Alternatively, create tables within a database other than the default database and set the LOCATION of that database to an S3 location.
  • You cannot dynamically switch between Glue Catalog and a Hive metastore. You must to restart the cluster for new Spark configurations to take effect.
  • The following features are not supported:

Troubleshooting

No IAM Instance Profile attached to the Databricks Runtime cluster

Running any operation that requires metastore lookup will result in an exception message like this:

org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), com.amazonaws.auth.profile.ProfileCredentialsProvider@2245a35d: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@52be6b57: The requested metadata is not found at https://169.254.169.254/latest/meta-data/iam/security-credentials/];

Solution: attach an IAM Instance Profile that has sufficient permissions to access the desired Glue Catalog.

Insufficient Glue Catalog Permissions

When the Glue Catalog permissions provided from the attached IAM Instance Profile is insufficient to perform certain metastore operations, running those operations will result in an exception message that looks like:

org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: User: arn:aws:sts::<aws-account-id>:assumed-role/<role-id>/... is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:<aws-region-for-glue-catalog>:<aws-account-id-for-glue-catalog>:catalog (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: <request-id>));

Whereas running the allowed actions would succeed.

Solution: Check that the attached IAM instance profile specifies sufficient permissions.

Mismatching Glue Catalog ID

By default, a Databricks cluster tries to connect to the Glue Catalog in the same AWS account as the one used for the Databricks deployment.

If the desired target Glue Catalog is in a different AWS account or region from the Databricks deployment, and the spark.hadoop.hive.metastore.glue.catalogid Spark configuration is not set, the cluster will connect to the Glue Catalog in the AWS account of the Databricks deployment instead of the desired one, which may cause confusion.

If the spark.hadoop.hive.metastore.glue.catalogid configuration is set, but the configurations in Step 2 were not properly done, all access to the metastore will result in an exception message that looks like:

org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: User:
arn:aws:sts::<aws-account-id>:assumed-role/<role-id>/... is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:<aws-region-for-glue-catalog>:<aws-account-id-for-glue-catalog>:catalog (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: <request-id>));

Solution: Check that the configurations are aligned as mentioned in Step 2 and 6 above.

Athena Catalog conflicts with Glue Catalog

When an AWS account has both a Glue Catalog and a Athena Catalog active and the latter is not upgraded to use the Glue Catalog yet, they may cause conflicts to each other. For example, if a database named “default” already exists in the Athena Catalog, a database with the same name cannot be created or accessed in the Glue Catalog. That can result in Databricks Runtime failing to connect to the Glue Catalog or failing to create and access some databases, and the exception message may be cryptic.

For example, if the “default” database exists in the Athena Catalog but not in the Glue Catalog, an exception will occur with message like:

AWSCatalogMetastoreClient: Unable to verify existence of default database:
com.amazonaws.services.glue.model.AccessDeniedException: Please migrate your Catalog to enable access to this database (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: <request-id>)

Solution: Follow Upgrading to the AWS Glue Data Catalog in the Amazon Athena User Guide, and migrate Athena to use the Glue Catalog instead.

Creating a table in a database with empty LOCATION

Databases in the Glue Catalog may be created from various sources. The ones created by Databricks Runtime will have a non-empty LOCATION field by default. But others, for example created in the Glue Console directly or imported from other sources, could have an empty LOCATION field.

When trying to create a table in a database with empty LOCATION field, an exception will occur with message like:

IllegalArgumentException: Can not create a Path from an empty string

Solution: Create the database in the Glue Catalog with a valid non-empty path in the LOCATION field, specify the LOCATION when creating the table in SQL, or specify option("path", <some_valid_path>) in the DataFrame API.

When you create a database in the AWS Glue Console, only the name is required; both the “Description” and “Location” are marked as optional. However, the Hive metastore operations depend on “Location”, so you must specify it.

Accessing tables and views created in other systems

Accessing tables and views created by other systems, such as AWS Athena or Presto, may or may not work in Databricks Runtime or Spark. This is not supported.

While they may sometimes work, such as when the table is a Hive-compatible one, others may fail with cryptic error messages. For example, accessing a view created by Athena, Databricks Runtime, or Spark may throw an exception like:

IllegalArgumentException: Can not create a Path from an empty string

That is because Athena and Presto store view metadata in a different format than what Databricks Runtime and Spark expect.