Work with Unity Catalog and the legacy Hive metastore
This article explains how to use a Unity Catalog metastore with the per-workspace Hive metastore in Databricks.
If your workspace was in service before it was enabled for Unity Catalog, it likely has a Hive metastore that contains data that you want to continue to use. Databricks recommends that you migrate the tables managed by the Hive metastore to the Unity Catalog metastore, but if you choose not to, this article explains how to work with data managed by both metastores.
Query the Hive metastore in Unity Catalog
The Unity Catalog metastore is additive, meaning it can be used with the per-workspace Hive metastore in Databricks. The Hive metastore appears as a top-level catalog called hive_metastore
in the three-level namespace.
For example, you can refer to a table called sales_raw
in the sales
schema in the legacy Hive metastore by using the following notation:
SELECT * from hive_metastore.sales.sales_raw;
display(spark.table("hive_metastore.sales.sales_raw"))
library(SparkR)
display(tableToDF("hive_metastore.sales.sales_raw"))
display(spark.table("hive_metastore.sales.sales_raw"))
You can also specify the catalog and schema with a USE
statement:
USE hive_metastore.sales;
SELECT * from sales_raw;
spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))
library(SparkR)
sql("USE hive_metastore.sales")
display(tableToDF("sales_raw"))
spark.sql("USE hive_metastore.sales")
display(spark.table("sales_raw"))
Access control in Unity Catalog and the Hive metastore
If you configured table access control on the Hive metastore, Databricks continues to enforce those access controls for data in the hive_metastore
catalog for clusters running in the shared access mode. The Unity Catalog access model differs slightly from legacy access controls, like no DENY
statements. The Hive metastore is a workspace-level object. Permissions defined within the hive_metastore
catalog always refer to the local users and groups in the workspace. See Differences from table access control.
Differences from table access control
Unity Catalog has the following key differences from using table access controls in the legacy Hive metastore in each workspace.
The access control model in Unity Catalog has the following differences from table access control:
Account groups: Access control policies in Unity Catalog are applied to account groups, while access control policies for the Hive metastore are applied to workspace-local groups. See Special considerations for groups.
USE CATALOG
andUSE SCHEMA
permissions are required on the catalog and schema for all operations on objects inside the catalog or schema: Regardless of a principal’s privileges on a table, the principal must also have theUSE CATALOG
privilege on its parent catalog to access the schema and theUSE SCHEMA
privilege to access objects within the schema. With workspace-level table access controls, on the other hand, grantingUSAGE
on the root catalog automatically grantsUSAGE
on all databases, butUSAGE
on the root catalog is not required.Views: In Unity Catalog, the owner of a view does not need to be an owner of the view’s referenced tables and views. Having the
SELECT
privilege is sufficient, along withUSE SCHEMA
on the views’ parent schema andUSE CATALOG
on the parent catalog. With workspace-level table access controls, a view’s owner needs to be an owner of all referenced tables and views.No support for
ALL FILES
orANONYMOUS FUNCTION
s: In Unity Catalog, there is no concept of anALL FILES
orANONYMOUS FUNCTION
permission. These permissions could be used to circumvent access control restrictions by allowing an unprivileged user to run privileged code.
Joins between Unity Catalog and Hive metastore objects
By using three-level namespace notation, you can join data in a Unity Catalog metastore with data in the legacy Hive metastore.
Note
A join with data in the legacy Hive metastore will only work on the workspace where that data resides. Trying to run such a join in another workspace results in an error. Databricks recommends that you upgrade legacy tables and views to Unity Catalog.
The following example joins results from the sales_current
table in the legacy Hive metastore with the sales_historical
table in the Unity Catalog metastore when the order_id
fields are equal.
SELECT * FROM hive_metastore.sales.sales_current
JOIN main.shared_sales.sales_historical
ON hive_metastore.sales.sales_current.order_id = main.shared_sales.sales_historical.order_id;
dfCurrent = spark.table("hive_metastore.sales.sales_current")
dfHistorical = spark.table("main.shared_sales.sales_historical")
display(dfCurrent.join(
other = dfHistorical,
on = dfCurrent.order_id == dfHistorical.order_id
))
library(SparkR)
dfCurrent = tableToDF("hive_metastore.sales.sales_current")
dfHistorical = tableToDF("main.shared_sales.sales_historical")
display(join(
x = dfCurrent,
y = dfHistorical,
joinExpr = dfCurrent$order_id == dfHistorical$order_id))
val dfCurrent = spark.table("hive_metastore.sales.sales_current")
val dfHistorical = spark.table("main.shared_sales.sales_historical")
display(dfCurrent.join(
right = dfHistorical,
joinExprs = dfCurrent("order_id") === dfHistorical("order_id")
))
Default catalog
If you omit the top-level catalog name when you perform data operations, and there is no USE CATALOG
statement, the default catalog is assumed. Of course, if you are supplying the catalog name in another way, such as setting it on a JDBC string, then that catalog is assumed instead of the default catalog when you omit the catalog name in data operations.
By default, the default catalog is hive_metastore
. You can configure a different default catalog.
If your Databricks workspace used the per-workspace Hive metastore before the workspace was enabled for Unity Catalog, Databricks recommends that you keep the default catalog value as hive_metastore
so that your existing code can operate on current Hive metastore data without any change.
Warning
Changing the default catalog can break existing data operations that depend on it.
Switch default catalog
To configure a different default catalog for a workspace, you can use either of these approaches:
Use the Account API to specify a different
default_catalog_name
when you assign a metastore to a workspace (or update the metastore assignment). Use POST /accounts/{account_id}/workspaces/{workspace_id}/metastores/{metastore_id} or PUT /accounts/{account_id}/workspaces/{workspace_id}/metastores/{metastore_id}.Run this Databricks CLI command:
databricks unity-catalog metastores assign --workspace-id 1234567890123456 \ --metastore-id 12a345b6-7890-1cd2-3456-e789f0a12b34 \ --default-catalog-name my_catalog
All SQL warehouses and clusters will use this catalog as the default.
You can also override the default catalog for a specific cluster by setting the following Spark configuration on the cluster. This approach is not available for SQL warehouses:
spark.databricks.sql.initial.catalog.name
For instructions, see Spark configuration.
Get the current default catalog
To get the current default catalog, run this Databricks CLI command, replacing the metastore ID with yours:
databricks unity-catalog metastores get --id 12a345b6-7890-1cd2-3456-e789f0a12b34
To get the metastore ID, run databricks unity-catalog metastores get-summary
. For an example, see Get summarized information about the current metastore.
Cluster-scoped data access permissions
When you use the Hive metastore alongside Unity Catalog, data access credentials associated with the cluster are used to access Hive metastore data but not data registered in Unity Catalog.
If users access paths that are outside Unity Catalog (such as a path not registered as a table or external location) then the access credentials assigned to the cluster are used.
See Connect to Amazon S3.
Upgrade legacy tables to Unity Catalog
Tables in the Hive metastore do not benefit from the full set of security and governance features that Unity Catalog introduces, such as built-in auditing and access control. Databricks recommends that you upgrade your legacy tables by adding them to Unity Catalog.