External Apache Hive metastore (legacy)
Important
This documentation has been retired and might not be updated.
Note
Using external metastores is a legacy data governance model. Databricks recommends that you upgrade to Unity Catalog. Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account. See What is Unity Catalog?.
This article describes how to set up Databricks clusters to connect to existing external Apache Hive metastores. It provides information about metastore deployment modes, recommended network setup, and cluster configuration requirements, followed by instructions for configuring clusters to connect to an external metastore. For Hive library versions included in Databricks Runtime, see the relevant Databricks Runtime version release notes.
Important
If you use Azure Database for MySQL as an external metastore, you must change the value of the
lower_case_table_names
property from 1 (the default) to 2 in the server-side database configuration. For details, see Identifier Case Sensitivity.If you use a read-only metastore database, Databricks strongly recommends that you set
spark.databricks.delta.catalog.update.enabled
tofalse
on your clusters for better performance.
Hive metastore deployment modes
In a production environment, you can deploy a Hive metastore in two modes: local and remote.
Local mode
The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
Remote mode
Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.
For more details about these deployment modes, see the Hive documentation.
Note
The examples in this document use MySQL as the underlying metastore database.
Network setup
Databricks clusters run inside a virtual private cloud (VPC). We recommend that you set up the external Hive metastore inside a new VPC and then peer these two VPCs to make clusters connect to the Hive metastore using a private IP address. VPC peering provides detailed instructions about how to peer the VPC used by Databricks clusters and the VPC where the metastore lives. After peering the VPCs, you can test network connectivity from a cluster to the metastore VPC by running the following command inside a notebook:
%sh
nc -vz <DNS name or private IP> <port>
where
<DNS name or private IP>
is the DNS name or the private IP address of the MySQL database (for local mode) or the metastore service (for remote mode). If you use a DNS name here, make sure that the resolved IP address is a private one.<port>
is the port of the MySQL database or the port of the metastore service.
Cluster configurations
You must set three sets of configuration options to connect a cluster to an external metastore:
Spark options configure Spark with the Hive metastore version and the JARs for the metastore client.
Hive options configure the metastore client to connect to the external metastore.
An optional set of Hadoop options configure file system options.
Spark configuration options
Set spark.sql.hive.metastore.version
to the version of your Hive metastore and spark.sql.hive.metastore.jars
as follows:
Hive 0.13: do not set
spark.sql.hive.metastore.jars
.Note
Hive 1.2.0 and 1.2.1 are not the built-in metastore on Databricks Runtime 7.0 and above. If you want to use Hive 1.2.0 or 1.2.1 with Databricks Runtime 7.0 and above, follow the procedure described in Download the metastore jars and point to them.
Hive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above): set
spark.sql.hive.metastore.jars
tobuiltin
.For all other Hive versions, Databricks recommends that you download the metastore JARs and set the configuration
spark.sql.hive.metastore.jars
to point to the downloaded JARs using the procedure described in Download the metastore jars and point to them.
Download the metastore jars and point to them
Create a cluster with
spark.sql.hive.metastore.jars
set tomaven
andspark.sql.hive.metastore.version
to match the version of your metastore.When the cluster is running, search the driver log and find a line like the following:
17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
The directory
<path>
is the location of downloaded JARs in the driver node of the cluster.Alternatively you can run the following code in a Scala notebook to print the location of the JARs:
import com.typesafe.config.ConfigFactory val path = ConfigFactory.load().getString("java.io.tmpdir") println(s"\nHive JARs are downloaded to the path: $path \n")
Run
%sh cp -r <path> /dbfs/hive_metastore_jar
(replacing<path>
with your cluster’s info) to copy this directory to a directory in DBFS root calledhive_metastore_jar
through the DBFS client in the driver node.Create an init script that copies
/dbfs/hive_metastore_jar
to the local filesystem of the node, making sure to make the init script sleep a few seconds before it accesses the DBFS client. This ensures that the client is ready.Set
spark.sql.hive.metastore.jars
to use this directory. If your init script copies/dbfs/hive_metastore_jar
to/databricks/hive_metastore_jars/
, setspark.sql.hive.metastore.jars
to/databricks/hive_metastore_jars/*
. The location must include the trailing/*
.Restart the cluster.
Hive configuration options
This section describes options specific to Hive.
Configuration options for local mode
To connect to an external metastore using local mode, set the following Hive configuration options:
# JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionURL jdbc:mysql://<metastore-host>:<metastore-port>/<metastore-db>
# Username to use against metastore database
javax.jdo.option.ConnectionUserName <mysql-username>
# Password to use against metastore database
javax.jdo.option.ConnectionPassword <mysql-password>
# Driver class name for a JDBC metastore (Runtime 3.4 and later)
javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver
# Driver class name for a JDBC metastore (prior to Runtime 3.4)
# javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
where
<metastore-host>
and<metastore-port>
are the host and listening port of your MySQL instance.<metastore-db>
is the name of the MySQL database that holds all of the metastore tables.<mysql-username>
and<mysql-password>
specify the username and password of your MySQL account that has read/write access to<metastore-db>
.
Note
Use the MariaDB driver to communicate with MySQL databases.
For production environments, we recommend that you set
hive.metastore.schema.verification
totrue
. This prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client version does not match the metastore database version. When enabling this setting for metastore client versions lower than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the issue described in HIVE-9749).For Hive metastore 1.2.0 and higher, set
hive.metastore.schema.verification.record.version
totrue
to enablehive.metastore.schema.verification
.For Hive metastore 2.1.1 and higher, set
hive.metastore.schema.verification.record.version
totrue
as it is set tofalse
by default.
Configuration options for remote mode
To connect to an external metastore using remote mode, set the following Hive configuration option:
# Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
hive.metastore.uris thrift://<metastore-host>:<metastore-port>
where <metastore-host>
and <metastore-port>
are the listening host and port of your Hive metastore service.
File system options
If you want to use an instance profile and set AssumeRole, you must set:
fs.s3a.credentialsType
toAssumeRole
fs.s3a.stsAssumeRole.arn
to the Amazon Resource Name (ARN) of the role to assume
Set up an external metastore using the UI
To set up an external metastore using the Databricks UI:
Click the Clusters button on the sidebar.
Click Create Cluster.
Enter the following Spark configuration options:
Local mode
# Hive specific configuration options. # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client. spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db> # Driver class name for a JDBC metastore (Runtime 3.4 and later) spark.hadoop.javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver # Driver class name for a JDBC metastore (prior to Runtime 3.4) # spark.hadoop.javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver spark.hadoop.javax.jdo.option.ConnectionUserName <mysql-username> spark.hadoop.javax.jdo.option.ConnectionPassword <mysql-password> # Spark specific configuration options spark.sql.hive.metastore.version <hive-version> # Skip this one if <hive-version> is 0.13.x. spark.sql.hive.metastore.jars <hive-jar-source> # If you need to use AssumeRole, uncomment the following settings. # spark.hadoop.fs.s3a.credentialsType AssumeRole # spark.hadoop.fs.s3a.stsAssumeRole.arn <sts-arn>
Remote mode
# Hive specific configuration option # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client. spark.hadoop.hive.metastore.uris thrift://<metastore-host>:<metastore-port> # Spark specific configuration options spark.sql.hive.metastore.version <hive-version> # Skip this one if <hive-version> is 0.13.x. spark.sql.hive.metastore.jars <hive-jar-source> # If you need to use AssumeRole, uncomment the following settings. # spark.hadoop.fs.s3a.credentialsType AssumeRole # spark.hadoop.fs.s3a.stsAssumeRole.arn <sts-arn>
Continue your cluster configuration, following the instructions in Compute configuration reference.
Click Create Cluster to create the cluster.
Set up an external metastore using an init script
Init scripts let you connect to an existing Hive metastore without manually setting required configurations.
Local mode
Create the base directory you want to store the init script in if it does not exist. The following example uses
dbfs:/databricks/scripts
.Run the following snippet in a notebook. The snippet creates the init script
/databricks/scripts/external-metastore.sh
in Databricks File System (DBFS). Alternatively, you can use the DBFS REST API put operation to create the init script. This init script writes required configuration options to a configuration file named00-custom-spark.conf
in a JSON-like format under/databricks/driver/conf/
inside every node of the cluster. Databricks provides default Spark configurations in the/databricks/driver/conf/spark-branch.conf
file. Configuration files in the/databricks/driver/conf
directory apply in reverse alphabetical order. If you want to change the name of the00-custom-spark.conf
file, make sure that it continues to apply before thespark-branch.conf
file.
dbutils.fs.put(
"/databricks/scripts/external-metastore.sh",
"""#!/bin/sh
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options for metastores in local mode.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>"
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mysql-username>"
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mysql-password>"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "<hive-version>"
| # Skip this one if <hive-version> is 0.13.x.
| "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
|
| # If you need to use AssumeRole, uncomment the following settings.
| # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole"
| # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>"
|EOF
|
|case "$DATABRICKS_RUNTIME_VERSION" in
| "")
| DRIVER="com.mysql.jdbc.Driver"
| ;;
| *)
| DRIVER="org.mariadb.jdbc.Driver"
| ;;
|esac
|# Add the JDBC driver separately since must use variable expansion to choose the correct
|# driver version.
|cat << EOF >> /databricks/driver/conf/00-custom-spark.conf
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "$DRIVER"
|}
|EOF
|""".stripMargin,
overwrite = true
)
contents = """#!/bin/sh
# Loads environment variables to determine the correct JDBC driver to use.
source /etc/environment
# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
[driver] {
# Hive specific configuration options for metastores in local mode.
# spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
"spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>"
"spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mysql-username>"
"spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mysql-password>"
# Spark specific configuration options
"spark.sql.hive.metastore.version" = "<hive-version>"
# Skip this one if <hive-version> is 0.13.x.
"spark.sql.hive.metastore.jars" = "<hive-jar-source>"
# If you need to use AssumeRole, uncomment the following settings.
# "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole"
# "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>"
EOF
case "$DATABRICKS_RUNTIME_VERSION" in
"")
DRIVER="com.mysql.jdbc.Driver"
;;
*)
DRIVER="org.mariadb.jdbc.Driver"
;;
esac
# Add the JDBC driver separately since must use variable expansion to choose the correct
# driver version.
cat << EOF >> /databricks/driver/conf/00-custom-spark.conf
"spark.hadoop.javax.jdo.option.ConnectionDriverName" = "$DRIVER"
}
EOF"""
dbutils.fs.put(
file = "/databricks/scripts/external-metastore.sh",
contents = contents,
overwrite = True
)
Configure your cluster with the init script.
Restart the cluster.
Remote mode
Create the base directory you want to store the init script in if it does not exist. The following example uses
dbfs:/databricks/scripts
.Run the following snippet in a notebook:
dbutils.fs.put( "/databricks/scripts/external-metastore.sh", """#!/bin/sh | |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation. |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf |[driver] { | # Hive specific configuration options for metastores in remote mode. | # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client. | "spark.hadoop.hive.metastore.uris" = "thrift://<metastore-host>:<metastore-port>" | | # Spark specific configuration options | "spark.sql.hive.metastore.version" = "<hive-version>" | # Skip this one if <hive-version> is 0.13.x. | "spark.sql.hive.metastore.jars" = "<hive-jar-source>" | | # If you need to use AssumeRole, uncomment the following settings. | # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole" | # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>" |} |EOF |""".stripMargin, overwrite = true )
contents = """#!/bin/sh # Quoting the label (i.e. EOF) with single quotes to disable variable interpolation. cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf [driver] { # Hive specific configuration options for metastores in remote mode. # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client. "spark.hadoop.hive.metastore.uris" = "thrift://<metastore-host>:<metastore-port>" # Spark specific configuration options "spark.sql.hive.metastore.version" = "<hive-version>" # Skip this one if <hive-version> is 0.13.x. "spark.sql.hive.metastore.jars" = "<hive-jar-source>" # If you need to use AssumeRole, uncomment the following settings. # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole" # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>" } EOF""" dbutils.fs.put( file = "/databricks/scripts/external-metastore.sh", contents = contents, overwrite = True )
Configure your cluster with the init script.
Restart the cluster.
Troubleshooting
Clusters do not start (due to incorrect init script settings)
If an init script for setting up the external metastore causes cluster creation to fail, configure the init script to log, and debug the init script using the logs.
Error in SQL statement: InvocationTargetException
Error message pattern in the full exception stack trace:
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = [...]
External metastore JDBC connection information is misconfigured. Verify the configured hostname, port, username, password, and JDBC driver class name. Also, make sure that the username has the right privilege to access the metastore database.
Error message pattern in the full exception stack trace:
Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. [...]
External metastore database not properly initialized. Verify that you created the metastore database and put the correct database name in the JDBC connection string. Then, start a new cluster with the following two Spark configuration options:
datanucleus.schema.autoCreateTables true datanucleus.fixedDatastore false
In this way, the Hive client library will try to create and initialize tables in the metastore database automatically when it tries to access them but finds them absent.
Error in SQL statement: AnalysisException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetastoreClient
Error message in the full exception stacktrace:
The specified datastore driver (driver name) was not found in the CLASSPATH
The cluster is configured to use an incorrect JDBC driver.
This error can occur if a cluster using Runtime 3.4 or later is configured to use the MySQL rather than the MariaDB driver.
Setting datanucleus.autoCreateSchema to true doesn’t work as expected
By default, Databricks also sets datanucleus.fixedDatastore
to true
, which prevents any accidental structural changes to the metastore databases. Therefore, the Hive client library cannot create metastore tables even if you set datanucleus.autoCreateSchema
to true
. This strategy is, in general, safer for production environments since it prevents the metastore database to be accidentally upgraded.
If you do want to use datanucleus.autoCreateSchema
to help initialize the metastore database, make sure you set datanucleus.fixedDatastore
to false
. Also, you may want to flip both flags after initializing the metastore database to provide better protection to your production environment.
com.amazonaws.AmazonClientException: Couldn’t initialize a SAX driver to create an XMLReader
This exception may be thrown if the version of the cluster is 2.1.1-db5. This issue has been fixed in 2.1.1-db6. For 2.1.1-db5, you can fix this issue by setting the following JVM properties as part of the settings of spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions
:
-Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
-Djavax.xml.validation.SchemaFactory:https://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory
-Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser
-Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl