External Hive Metastore

Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, you have the option to use your existing external Hive metastore instance.

This topic describes how to set up Databricks clusters that connect to existing Hive metastores. We provide information about metastore deployment modes, recommended network setup, and cluster configuration requirements, followed by instructions for configuring clusters to connect to an external metastore. The following table summarizes which Hive metastore versions are supported in each version of Databricks Runtime.

Databricks Runtime Version Hive 0.13 - 1.2.1 Hive 2.0 Hive 2.1 Hive 2.2 Hive 2.3
5.x Yes Yes Yes Yes Yes
4.x Yes Yes Yes Yes Yes
3.x Yes Yes Yes No No
2.1.1-db4 and higher versions of 2.1.x Yes Yes Yes No No
lower than 2.1.1-db4 Yes No No No No

Important

SQL Server does not work as the underlying metastore database for Hive 2.0 to 2.2.

Hive metastore deployment modes

In a production environment, you can deploy a Hive metastore in two different modes:

  • Local mode: The metastore client running inside a cluster connects to the underlying metastore database directly via JDBC.
  • Remote mode: Instead of connecting to the underlying database directly, the metastore client connects to a separate metastore service via the Thrift protocol. The metastore service connects to the underlying database. When running a metastore in remote mode, DBFS is not supported.

For more details about these deployment modes, see the Hive documentation.

Note

The examples in this document use MySQL as the underlying metastore database.

Network setup

Databricks clusters live inside their own VPC. We recommend that users set up the external Hive metastore inside a new VPC and then peer these two VPCs to make clusters connect to the Hive metastore using a private IP address. VPC Peering provides detailed instructions about how to peer the VPC used by Databricks clusters and the VPC where the metastore lives.

After peering the VPCs, you can test network connectivity from a cluster to the metastore VPC by running the following command inside a notebook:

%sh
nc -vz <DNS name or private IP> <port>

where

  • <DNS name or private IP> is the DNS name or the private IP address of the MySQL database (for local mode) or the metastore service (for remote mode). If you use a DNS name here, make sure that the resolved IP address is a private one.
  • <port> is the port of the MySQL database or the port of the metastore service.

Cluster configurations

You must set three sets of configuration options to connect a cluster to an external metastore:

Spark-specific options

Set spark.sql.hive.metastore.version to the version of your Hive metastore and spark.sql.hive.metastore.jars as follows:

  • Hive 0.13: do not set spark.sql.hive.metastore.jars.

  • Hive 1.2.0 or 1.2.1: set spark.sql.hive.metastore.jars to builtin.

  • For all other Hive versions, Databricks recommends that you download the metastore JARs and set the configuration spark.sql.hive.metastore.jars to point to the downloaded JARs using the following procedure:

    1. Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore.

    2. When the cluster is running, search the driver log and find a line like the following:

      17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
      

      The directory <path> is the location of downloaded JARs in the driver node of the cluster.

      Alternatively you can run the following code in a Scala notebook to print the location of the JARs:

      import com.typesafe.config.ConfigFactory
      val path = ConfigFactory.load().getString("java.io.tmpdir")
      
      println(s"\nHive JARs are downloaded to the path: $path \n")
      
    3. Run %sh cp -r <path> /dbfs/hive_metastore_jar (replacing <path> with your cluster’s info) to copy this directory to a directory in DBFS called hive_metastore_jar through the Fuse client in the driver node.

    4. Create an init script that copies /dbfs/hive_metastore_jar to the local filesystem of the node, making sure to make the init script sleep a few seconds before it accesses the DBFS Fuse client. This ensures that the client is ready.

    5. Set spark.sql.hive.metastore.jars to use this directory. If your init script copies /dbfs/hive_metastore_jar to /databricks/hive_metastore_jars/, set spark.sql.hive.metastore.jars to /databricks/hive_metastore_jars/*. The location must include the trailing /*.

    6. Restart the cluster.

Hive-specific options

This section describes options specific to Hive.

Configuration options for local mode

When you connect to an external metastore using local mode, you must set the following Hive configuration options:

# JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionURL jdbc:mysql://<metastore-host>:<metastore-port>/<metastore-db>

# Username to use against metastore database
javax.jdo.option.ConnectionUserName <mysql-username>

# Password to use against metastore database
javax.jdo.option.ConnectionPassword <mysql-password>

# Driver class name for a JDBC metastore (Runtime 3.4 and later)
javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver

# Driver class name for a JDBC metastore (prior to Runtime 3.4)
# javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver

where

  • <metastore-host> and <metastore-port> are the host and listening port of your MySQL instance.
  • <metastore-db> is the name of the MySQL database that holds all of the metastore tables.
  • <mysql-username> and <mysql-password> specify the username and password of your MySQL account that has read/write access to <metastore-db>.

Note

Beginning with Databricks Runtime 3.4, the MySQL JDBC driver is no longer included by default. Use the MariaDB driver to communicate with MySQL databases.

Note

For production environments, we recommend that you set hive.metastore.schema.verification to true. This prevents Hive metastore client from implicitly modifying the metastore database schema when the metastore client version does not match the metastore database version. When enabling this setting for metastore client versions lower than Hive 1.2.0, make sure that the metastore client has the write permission to the metastore database (to prevent the issue described in HIVE-9749).

  • For Hive metastore 1.2.0 and higher, hive.metastore.schema.verification.record.version must be set to true in order to enable hive.metastore.schema.verification.
  • For Hive metastore 2.1.1 and higher, hive.metastore.schema.verification.record.version must be explicitly enabled as it is set to false by default.

Configuration options for remote mode

When you connect to an external metastore using remote mode, you must set the following Hive configuration option:

# Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.
hive.metastore.uris thrift://<metastore-host>:<metastore-port>

where <metastore-host> and <metastore-port> are the listening host and port of your Hive metastore service.

File system options

To ensure that clusters in Databricks can access data from the tables in the external metastore, it is important to check the file system schemes of table paths. If there is any table with a path using s3 as the scheme, set fs.s3.impl to com.databricks.s3a.S3AFileSystem.

Also, if you want to use an IAM role and set AssumeRole, you must set:

  • fs.s3a.impl, fs.s3n.impl, and fs.s3.impl to com.databricks.s3a.S3AFileSystem
  • fs.s3a.credentialsType to AssumeRole
  • fs.s3a.stsAssumeRole.arn to the Amazon Resource Name (ARN) of the role to assume

Set up an external metastore using the web UI

To set up an external metastore using the Databricks web UI:

  1. Click the Clusters button on the sidebar.

  2. Click Create Cluster.

  3. Click Show advanced settings, and navigate to the Spark tab.

  4. Enter the following Spark configuration options:

    • If the external metastore is deployed in the local mode, set the following configurations under Spark Config.

      # Hive specific configuration options.
      # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>
      
      # Driver class name for a JDBC metastore (Runtime 3.4 and later)
      spark.hadoop.javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver
      
      # Driver class name for a JDBC metastore (prior to Runtime 3.4)
      # spark.hadoop.javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
      
      spark.hadoop.javax.jdo.option.ConnectionUserName <mysql-username>
      spark.hadoop.javax.jdo.option.ConnectionPassword <mysql-password>
      
      # Spark specific configuration options
      spark.sql.hive.metastore.version <hive-version>
      # Skip this one if <hive-version> is 0.13.x.
      spark.sql.hive.metastore.jars <hive-jar-source>
      
      # If any of your tables or databases use s3 as the file system scheme,
      # uncomment the next line to set the s3:// URL scheme to the S3A file system.
      # spark.hadoop prefix is added to make sure these file system options
      # propagate to the metastore client and Hadoop configuration.
      # spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
      
      # If you need to use AssumeRole, uncomment the following settings.
      # spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
      # spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
      # spark.hadoop.fs.s3a.credentialsType AssumeRole
      # spark.hadoop.fs.s3a.stsAssumeRole.arn <sts-arn>
      
    • If the external metastore is deployed in the remote mode, set the following configurations under Spark Config.

      # Hive specific configuration option
      # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      spark.hadoop.hive.metastore.uris thrift://<metastore-host>:<metastore-port>
      
      # Spark specific configuration options
      spark.sql.hive.metastore.version <hive-version>
      # Skip this one if <hive-version> is 0.13.x.
      spark.sql.hive.metastore.jars <hive-jar-source>
      
      # If any of your tables or databases use s3 as the file system scheme,
      # uncomment the next line to set the s3:// URL scheme to the S3A file system.
      # spark.hadoop prefix is added to make sure these file system options
      # propagate to the metastore client and Hadoop configuration.
      # spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
      
      # If you need to use AssumeRole, uncomment the following settings.
      # spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem
      # spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
      # spark.hadoop.fs.s3a.credentialsType AssumeRole
      # spark.hadoop.fs.s3a.stsAssumeRole.arn <sts-arn>
      
  5. Continue your cluster configuration, following the instructions in Cluster Configurations.

  6. Click Create Cluster to create the cluster.

Set up an external metastore using an init script

Cluster Node Initialization Scripts can be a convenient alternative when you want your clusters to connect to your existing Hive metastore without explicitly setting required configurations.

To set up an external metastore in the local mode using an init script, open a notebook and execute the following snippet. This snippet adds an init script external-metastore.sh (this name is not mandatory) to /databricks/init/<cluster-name>/ in Databricks File System (DBFS). Alternatively, you can use the DBFS REST API put operation to create the init script. This init script writes required configuration options to a configuration file named 00-custom-spark.conf in a JSON-like format under /databricks/driver/conf/ inside every node of the cluster, whenever a cluster with the name specified as <cluster-name> starts. Note that Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf file. Configuration files in the /databricks/driver/conf directory apply in reverse alphabetical order. If you want to change the name of the 00-custom-spark.conf file, make sure that it continues to apply before the spark-branch.conf file.

Note

If you want to set up all of your clusters automatically to access the external metastore, you can use a global init script. Simply change the location of the init script from /databricks/init/<cluster-name>/external-metastore.sh to /databricks/init/external-metastore.sh.

%scala

dbutils.fs.put(
    "/databricks/init/<cluster-name>/external-metastore.sh",
    """#!/bin/sh
      |# Loads environment variables to determine the correct JDBC driver to use.
      |source /etc/environment
      |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
      |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
      |[driver] {
      |    # Hive specific configuration options for metastores in the local mode.
      |    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      |    "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>"
      |    "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<mysql-username>"
      |    "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<mysql-password>"
      |
      |    # Spark specific configuration options
      |    "spark.sql.hive.metastore.version" = "<hive-version>"
      |    # Skip this one if <hive-version> is 0.13.x.
      |    "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
      |
      |    # If any of your table or database use s3 as the file system scheme,
      |    # uncomment the next line to set the s3:// URL scheme to S3A file system.
      |    # spark.hadoop prefix is added to make sure these file system options will
      |    # propagate to the metastore client and Hadoop configuration.
      |    # "spark.hadoop.fs.s3.impl" = "com.databricks.s3a.S3AFileSystem"
      |
      |    # If you need to use AssumeRole, uncomment the following settings.
      |    # "spark.hadoop.fs.s3a.impl" = "com.databricks.s3a.S3AFileSystem"
      |    # "spark.hadoop.fs.s3n.impl" = "com.databricks.s3a.S3AFileSystem"
      |    # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole"
      |    # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>"
      |EOF
      |
      |case "$DATABRICKS_RUNTIME_VERSION" in
      |  "")
      |     DRIVER="com.mysql.jdbc.Driver"
      |     ;;
      |  *)
      |     DRIVER="org.mariadb.jdbc.Driver"
      |     ;;
      |esac
      |# Add the JDBC driver separately since must use variable expansion to choose the correct
      |# driver version.
      |cat << EOF >> /databricks/driver/conf/00-custom-spark.conf
      |    "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "$DRIVER"
      |}
      |EOF
      |""".stripMargin,
    overwrite = true
)

To access an external metastore in the remote mode using an init script, run the following command in a notebook:

%scala

dbutils.fs.put(
    "/databricks/init/<cluster-name>/external-metastore.sh",
    """#!/bin/sh
      |
      |# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
      |cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
      |[driver] {
      |    # Hive specific configuration options for metastores in the remote mode.
      |    # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
      |    "spark.hadoop.hive.metastore.uris" = "thrift://<metastore-host>:<metastore-port>"
      |
      |    # Spark specific configuration options
      |    "spark.sql.hive.metastore.version" = "<hive-version>"
      |    # Skip this one if <hive-version> is 0.13.x.
      |    "spark.sql.hive.metastore.jars" = "<hive-jar-source>"
      |
      |    # If any of your table or database use s3 as the file system scheme,
      |    # uncomment the next line to set the s3:// URL scheme to S3A file system.
      |    # spark.hadoop prefix is added to make sure these file system options will
      |    # propagate to the metastore client and Hadoop configuration.
      |    # "spark.hadoop.fs.s3.impl" = "com.databricks.s3a.S3AFileSystem"
      |
      |    # If you need to use AssumeRole, uncomment the following settings.
      |    # "spark.hadoop.fs.s3a.impl" = "com.databricks.s3a.S3AFileSystem"
      |    # "spark.hadoop.fs.s3n.impl" = "com.databricks.s3a.S3AFileSystem"
      |    # "spark.hadoop.fs.s3a.credentialsType" = "AssumeRole"
      |    # "spark.hadoop.fs.s3a.stsAssumeRole.arn" = "<sts-arn>"
      |}
      |EOF
      |""".stripMargin,
    overwrite = true
)

Troubleshooting

Clusters do not start (due to incorrect init script settings)
If a global init script for setting up the external metastore causes cluster creation failure, you can convert the global init script to a cluster-scoped init script and then debug the init script using that cluster.
Error in SQL statement: InvocationTargetException

This exception usually appears in the following cases:

  • External metastore JDBC connection information misconfigured.

    If you see the following error message pattern in the full exception stack trace, then you probably hit this case:

    Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = [...]
    

    Verify the configured hostname, port, username, password, and JDBC driver class name. Also, make sure that the username has the right privilege to access the metastore database.

  • External metastore database not properly initialized.

    If you see the following error message pattern in the full exception stack trace, then you probably hit this case:

    Required table missing : "`DBS`" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. [...]
    

    Verify that you created the metastore database and put the correct database name in the JDBC connection string. Then, start a new cluster with the following two extra Spark configuration options:

    datanucleus.autoCreateSchema true
    datanucleus.fixedDatastore false
    

    In this way, the Hive client library will try to create and initialize tables in the metastore database automatically when it tries to access them but finds them absent.

Error in SQL statement: AnalysisException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetastoreClient

If you find the error The specified datastore driver (driver name) was not found in the CLASSPATH in the full exception stacktrace, the cluster is configured to use an incorrect JDBC driver.

This error can occur if a cluster using Runtime 3.4 or later is configured to use the MySQL rather than the MariaDB driver.

Setting datanucleus.autoCreateSchema to true doesn’t work as expected

By default, Databricks also sets datanucleus.fixedDatastore to true, which prevents any accidental structural changes to the metastore databases. Therefore, the Hive client library cannot create metastore tables even if you set datanucleus.autoCreateSchema to true. This strategy is, in general, safer for production environments since it prevents the metastore database to be accidentally upgraded.

If you do want to use datanucleus.autoCreateSchema to help initialize the metastore database, make sure you set datanucleus.fixedDatastore to false. Also, you may want to flip both flags after initializing the metastore database to provide better protection to your production environment.

com.amazonaws.AmazonClientException: Couldn’t initialize a SAX driver to create an XMLReader

This exception may be thrown if the version of the cluster is 2.1.1-db5. This issue has been fixed in 2.1.1-db6. For 2.1.1-db5, you can fix this issue by setting the following JVM properties as part of the settings of spark.driver.extraJavaOptions and spark.executor.extraJavaOptions:

-Djavax.xml.datatype.DatatypeFactory=com.sun.org.apache.xerces.internal.jaxp.datatype.DatatypeFactoryImpl
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
-Djavax.xml.validation.SchemaFactory:http://www.w3.org/2001/XMLSchema=com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory
-Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser
-Dorg.w3c.dom.DOMImplementationSourceList=com.sun.org.apache.xerces.internal.dom.DOMXSImplementationSourceImpl