Skip to main content

Create Lakeflow Declarative Pipelines with dlt-meta

This article introduces dlt-meta, a Databricks Labs project that provides tools to generate Lakeflow Declarative Pipelines from metadata that you maintain.

note

The open source dlt-meta project, like all projects in the databrickslabs GitHub account, exists for exploration purposes only. Databricks does not support it or provide service-level agreements (SLAs) for it. Do not submit Databricks support tickets for issues related to this project. Instead, file a GitHub issue, which will be reviewed as time permits.

What is dlt-meta?

Lakeflow Declarative Pipelines allows you to declaratively specify a table, and generates a flow in a pipeline that both creates the table and keeps it up to date as the source data changes. However, if your organization has hundreds of tables, generating and managing these pipelines is time consuming, and can lead to inconsistent practices.

The dlt-meta project is a metadata-driven metaprogramming framework designed to work with Lakeflow Declarative Pipelines. This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in a set of JSON and YAML files. The dlt-meta engine uses Python code to dynamically generate Lakeflow Declarative Pipelines code for the flows described in your metadata. You generate the metadata about your pipelines, and dlt-meta generates your pipelines.

With your logic centralized in one place (the metadata), your system is faster, reusable, and easier to maintain.

Benefits of dlt-meta

There are two main use cases for dlt-meta:

  • Ingest and clean a large number of tables simply.
  • Enforce data engineering standards across multiple pipelines and users.

The benefits of using a metadata-driven approach include:

  • Maintaining metadata can be done without knowledge of Python or SQL code.
  • Maintaining metadata, rather than the code, requires less overhead, and reduces errors.
  • The code is generated by dlt-meta, so it stays consistent and has less custom code across pipelines and published tables.
  • You can easily group tables into pipelines within the metadata, generating the number of pipelines needed to most efficiently update your data.

How does it work?

The following image shows an overview of the dlt-meta system:

dlt-meta overview

  1. You create the metadata files as input to dlt-meta, to specify your source files and outputs, quality rules, and required processing.
  2. The dlt-meta engine compiles the onboarding files into a data flow specification, called DataflowSpec and stores it for later use.
  3. The dlt-meta engine uses the DataflowSpec to create pipelines that generate your bronze tables. This uses your metadata files to read the source data and apply the correct data expectations to match your quality rules.
  4. The dlt-meta engine next uses the DataflowSpec to create additional pipelines that generate your silver tables. This uses your metadata files to apply the appropriate transformations and other processing for your system.

You run the pipelines generated by dlt-meta to keep the output current as your source data is updated.

How do I get started?

To use dlt-meta, you must:

  • Deploy and configure the dlt-meta solution.
  • Prepare the metadata for your bronze and silver layer tables.
  • Create a job to onboard the metadata.
  • Use the metadata to create pipelines for your tables.

The dlt-meta documentation on GitHub has a tutorial to help you get started with this process. For more information, see getting started with dlt-meta on GitHub.

Additional resources