Use Lakeflow Spark Declarative Pipelines with legacy Hive metastore

This article details configurations and caveats specific to Lakeflow Spark Declarative Pipelines configured to publish data to the legacy Hive metastore. Databricks recommends using Unity Catalog for all new pipelines. See Use Unity Catalog with pipelines.

note

This article discusses functionality for the current default publishing mode for pipelines. Pipelines created before February 5, 2025, might use the legacy publishing mode and LIVE virtual schema. See LIVE schema (legacy).

How to query streaming tables and materialized views in the legacy Hive metastore

After an update is complete, you can view the schema and tables, query the data, or use the data in downstream applications.

Once published, Lakeflow Spark Declarative Pipelines tables can be queried from any environment with access to the target schema. This includes Databricks SQL, notebooks, and other Lakeflow Spark Declarative Pipelines.

important

When you create a target configuration, only tables and associated metadata are published. Views are not published to the metastore.

Configure a pipeline to publish to Hive metastore

To publish to the legacy Hive metastore, choose Use Hive Metastore under Advanced options when creating a new pipeline (you may have to choose See more to see the option). You must specify a default target schema when publishing to Hive metastore. See Configure Pipelines.

Specify a storage location

You can specify a storage location for a pipeline that publishes to the Hive metastore. The primary motivation for specifying a location is to control the object storage location for data written by your pipeline. Databricks recommends always specificying a storage location to avoid writing to the DBFS root.

Because all tables, data, checkpoints, and metadata for Lakeflow Spark Declarative Pipelines are fully managed by Lakeflow Spark Declarative Pipelines, most interaction with Lakeflow Spark Declarative Pipelines datasets happens through tables registered to the Hive metastore or Unity Catalog.

Cloud storage configuration

To access a bucket in Google Cloud Storage (GCS), you must create a service account with access to that GCS bucket and add that service account to the cluster configurations. For more information about creating a Google Cloud service account, see Connect to Google Cloud Storage. You can add the service account configuration when you create or edit a pipeline with the Lakeflow Spark Declarative Pipelines API or in the Lakeflow Spark Declarative Pipelines UI:

When editing a pipeline in the Lakeflow Pipelines Editor, click Settings.
Click the JSON button.
Enter the service account configuration in the gcp_attributes.google_service_account field in the cluster configuration:

JSON
{
  "clusters": [
    {
      "gcp_attributes": {
        "google_service_account": "test-gcs-doc@databricks-dev.iam.gserviceaccount.com"
      }
    }
  ]
}

Work with event log for Hive metastore pipelines

If your pipeline publishes tables to the Hive metastore, the event log is stored in /system/events under the storage location. For example, if you have configured your pipeline storage setting as /Users/username/data, the event log is stored in the /Users/username/data/system/events path in DBFS.

If you have not configured the storage setting, the default event log location is /pipelines/<pipeline-id>/system/events in DBFS. For example, if the ID of your pipeline is 91de5e48-35ed-11ec-8d3d-0242ac130003, the storage location is /pipelines/91de5e48-35ed-11ec-8d3d-0242ac130003/system/events.

You can create a view to simplify querying the event log. The following example creates a temporary view called event_log_raw. This view is used in the example event log queries included in this article:

SQL
CREATE OR REPLACE TEMP VIEW event_log_raw
AS SELECT * FROM delta.`<event-log-path>`;

Replace <event-log-path> with the event log location.

Each instance of a pipeline run is called an update. You often want to extract information for the most recent update. Run the following query to find the identifier for the most recent update and save it in the latest_update_id temporary view. This view is used in the example event log queries included in this article:

SQL
CREATE OR REPLACE TEMP VIEW latest_update AS
SELECT origin.update_id AS id
FROM event_log_raw
WHERE event_type = 'create_update'
ORDER BY timestamp DESC
LIMIT 1;

You can query the event log in a Databricks notebook or the SQL editor. Use a notebook or the SQL editor to run the example event log queries.

Example pipeline source code notebooks for workspaces without Unity Catalog

You can import the following notebooks into a Databricks workspace without Unity Catalog enabled and use them to deploy Lakeflow Spark Declarative Pipelines. Import the notebook of your chosen language and specify the path in Source code field when configuring a pipeline with the Hive metastore storage option. See Configure Pipelines.

Get started with

Open notebook in new tab

Get started with

Open notebook in new tab

How to query streaming tables and materialized views in the legacy Hive metastore​

Configure a pipeline to publish to Hive metastore​

Specify a storage location​

Cloud storage configuration​

Work with event log for Hive metastore pipelines​

Example pipeline source code notebooks for workspaces without Unity Catalog​

Get started with

Get started with

How to query streaming tables and materialized views in the legacy Hive metastore

Configure a pipeline to publish to Hive metastore

Specify a storage location

Cloud storage configuration

Work with event log for Hive metastore pipelines

Example pipeline source code notebooks for workspaces without Unity Catalog