Skip to main content

Develop and debug ETL pipelines with the multi-file editor in DLT

Beta

This feature is in Beta for the Premium Plan. For all other plans, this feature is in Private Preview. To try it in Private Preview, contact your Databricks contact.

This article describes using the multi-file editor in DLT to develop and debug ETL (extract, transform, and load) pipelines. The multi-file editor shows a pipeline as a set of files in the pipeline assets browser. You can edit the files and control the configuration of the pipeline and which files to include in one location.

For the default development experience using a single notebook in DLT, see Develop and debug ETL pipelines with a notebook in DLT.

Overview of the multi-file editor

The ETL pipeline multi-file editor has the following features:

  1. Pipeline asset browser: Create, delete, rename, and organize pipeline assets.
  2. Multi-file code editor with tabs: Work across multiple code files associated with a pipeline.
  3. Pipeline-specific toolbar: Enables pipeline configuration and has pipeline-level run actions.
  4. Interactive directed acyclical graph (DAG): Get an overview of your tables, open the data previews bottom bar, and perform other table-related actions.
  5. Data preview: Inspect the data of your streaming tables and materialized views.
  6. Table-level execution insights: Get execution insights for all tables or a single table in a pipeline. The insights refer to the latest pipeline run.
  7. Error tray: This feature summarizes errors across all files in the pipeline, and you can navigate to where the error occurred inside a specific file. It complements code-affixed error indicators.
  8. Selective execution: The code editor has features for step-by-step development, such as the ability to refresh tables only in the current file using the Run file action or a single table.
  9. Default pipeline folder structure: New pipelines include a predefined folder structure and sample code that you can use as a starting point for your pipeline.
  10. Simplified pipeline creation: Provide a name, catalog, and schema where tables should be created by default, and a pipeline is created using default settings. You can later adjust Settings from the pipeline editor toolbar.

DLT multi-file editor

Enable the multi-file editor

note

If you use the feature in Private Preview, you must first enable Pipelines multi-file developer experience. See Manage Databricks Previews for more information.

You can enable the ETL pipeline multi-file editor in multiple ways:

  • When you create a new ETL pipeline, enable the multi-file editor in DLT with the ETL Pipeline editor toggle.

    DLT multi-file editor togle on

    The advanced settings page for the pipeline is used the first time you enable the multi-file editor. The simplified pipeline creation window is used the next time you create a new pipeline.

  • For an existing pipeline, open a notebook used in a pipeline and enable the ETL Pipeline editor toggle in the header. You can also go to the pipeline monitoring page and click Settings to enable the multi-file editor.

After you have enabled the ETL Pipeline editor toggle, all ETL pipelines will use the multi-file editor by default. You can turn the ETL pipeline multi-file editor on and off from the editor.

Alternatively, you can enable the multi-file editor from user settings:

  1. Click your user badge in the upper-right area of your workspace and then click Settings and Developer.
  2. Enable Tabs for notebooks and files.
  3. Enable ETL Pipeline multi-file editor.

Create a new ETL pipeline

To create a new ETL pipeline using the multi-file editor, follow these steps:

  1. At the top of the sidebar, click New and ETL pipeline.

    Create a new ETL pipeline

  2. In Name, type a unique name for your pipeline.

  3. Select an existing Default catalog, and an existing or new Default schema.

    The default catalog and the default schema are where datasets are read from or written to. See Database objects in Databricks for more information.

  4. Select Python or SQL as the Language for sample code.

    You can have both SQL and Python source code files in your ETL pipeline. When creating a new pipeline and choosing the Language for sample code, the language is only for the sample code included in your pipeline by default.

  5. Click Create.

    Create a new ETL pipeline

The ETL pipeline is created with the following default settings:

You can adjust these settings from the pipeline toolbar or select Create advanced pipeline to provide your preferred settings. See Configure a DLT pipeline for more information.

Alternatively, you can create an ETL pipeline from the workspace browser:

  1. Click Workspace in the left side panel.
  2. Click Create in the upper-right corner and click ETL pipeline.

You can also create an ETL pipeline from the jobs and pipelines page:

  1. Click Jobs in the left side panel.
  2. Click the Jobs & pipelines tab.
  3. Click Create in the upper-right corner and click ETL pipeline.

Open an existing ETL pipeline

To open an existing ETL pipeline in the multi-file editor, follow these steps:

  1. Click Workspace in the side panel.
  2. Navigate to a folder with source code files for your pipeline.
  3. Click the source code file to open the pipeline in the editor.

Open an existing ETL pipeline

You can also open an existing ETL pipeline in the following ways:

  • Open a source code file configured as source code for a pipeline from the workspace browser sidebar shown alongside editors.
  • On the Recents page on the left sidebar, open a pipeline or a file configured as the source code for a pipeline.
  • In the pipeline monitoring page, click Edit pipeline.
  • On the Job Runs page in the left sidebar, click Jobs & pipelines tab and click Kebabm menu and Edit pipeline.
  • When you create a new job and add a pipeline task, you can click open in new tab New Tab Icon when you choose a pipeline under Pipeline.

Pipeline assets browser

The multi-file pipeline editor has a special mode for the workspace browser sidebar called the Pipeline assets browser and, by default, focuses the panel on the pipeline. It has two tabs:

  • Pipeline: This is where you can find all files associated with the pipeline. You can create, delete, rename, and organize them into folders.
  • All files: All other workspace assets are available here.

Pipeline asset browser

You can have the following types of files in your pipeline:

  • Source code files: These files are part of the pipeline's source code definition, which can be seen in Settings. Databricks recommends always storing source code files inside the pipeline root folder; otherwise, they will be shown in an external file section at the bottom of the browser and have a less rich feature set.
  • Non-source code files: These files are stored inside the pipeline root folder but are not part of the pipeline source code definition.
important

You must use the pipeline assets browser under the Pipeline tab to manage files and folders for your pipeline. This will update the pipeline settings correctly. Moving or renaming files and folders from your workspace browser or the All files tab will break the pipeline configuration, and you must then resolve this manually in Settings.

Root folder

The pipeline assets browser is anchored in a pipeline root folder. When you create a new pipeline, the pipeline root folder is created in your user home folder and named the same as the pipeline name.

You can change the root folder in the pipeline assets browser. This is useful if you want to use a Git folder for your pipeline.

  1. Click Kebab menu for the root folder.
  2. Click Configure new root folder.
  3. Under Pipeline root folder click Folder Icon and choose another folder as the pipeline root folder.
  4. Click Save.

Change pipeline root folder

In the Kebab menu for the root folder, you can also click Rename root folder to rename the folder name. Here, you can also click Move root folder to move the root folder, for example, into a Git folder.

You can also change the pipeline root folder in settings:

  1. Click Settings.
  2. Under Code assets click Configure paths.
  3. Click Folder Icon to change the folder under Pipeline root folder.
  4. Click Save.
note

If you change the pipeline root folder, the file list displayed by the pipeline assets browser will be affected, as the files in the previous root folder will now be shown as external files.

Existing pipeline with no root folder

An existing pipeline created in the default development experience using a single notebook in DLT won't have a root folder configured. Follow these steps to configure the root folder for your existing pipeline:

  1. In the pipeline assets browser, click Configure.
  2. Click Folder Icon to select the root folder under Pipeline root folder.
  3. Click Save.

No pipeline root folder

Default folder structure

When you create a new pipeline, a default folder structure is created. This is the recommended structure for organizing your pipeline source and non-source code files, as described below.

A small number of sample code files are created in this folder structure.

Folder name

Recommended location for these types of files

<pipeline_root_folder>

Root folder that contains all folders and files for your pipeline.

explorations

Non-source code files, such as notebooks, queries, and code files used for explorative data analysis.

transformations

Source code files, such as Python or SQL code files with table definitions.

utilities

Non-source code files with Python modules that can be imported from other code files. If you choose SQL as your language for sample code, this folder will not be created.

You can rename the folder names or change the structure to fit your workflow. To add a new source code folder, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Create pipeline source code folder.
  3. Enter a folder name and click Create.

Source code files

Source code files are part of the pipeline's source code definition. When you run the pipeline, these files are evaluated. Files and folders part of the source code definition have a special icon with a mini Pipeline icon superimposed.

To add a new source code file, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Transformation or Data source.
  3. Enter a Name for the file and select Python or SQL as the Language.
  4. Click Create.

You can also click Kebab menu for any folder in the pipeline assets browser to add a source code file.

A transformations folder for source code is created by default when you create a new pipeline. When you add a data source file, a data_sources folder is created if it doesn't exist.

Folder name

Description

transformations

This folder is the recommended location for pipeline source code, such as Python or SQL code files with pipeline table definitions. This folder is created by default when you create a new pipeline.

data_sources

This folder is the recommended location for code related to reading source datasets, such as creating views or loading data from cloud files. It can be combined with Spark configuration parameters or Databricks Asset Bundles (DABs) to change where a pipeline reads from depending on various conditions. For example, DABs could read different source data across dev and prod environments using a folder path convention of sources/dev/ and sources/prod/, respectively.

This folder is not created by default. If the folder doesn't exist, it will be created when you click Add and Data source in the pipeline assets browser to create a data source file.

Non-source code files

Non-source code files are stored inside the pipeline root folder but are not part of the pipeline source code definition. These files are not evaluated when you run the pipeline. Non-source code files cannot be external files.

You can use this for files related to your work on the pipeline that you'd like to store together with the source code. For example:

  • Notebooks that you use for ad hoc explorations executed on non-DLT compute outside the lifecycle of a pipeline.
  • Python modules that are not to be evaluated with your source code unless you explicitly import these modules inside your source code files.

To add a new non-source code file, follow these steps:

  1. Click Add in the pipeline assets browser.
  2. Click Exploration or Utility.
  3. Enter a Name for the file.
  4. Click Create.

You can also click Kebab menu for the pipeline root folder or a non-source code file to add non-source code files to the folder.

When you create a new pipeline, the following folders for non-source code files are created by default:

Folder name

Description

explorations

This folder is the recommended location for notebooks, queries, dashboards, and other files and then run them on non-DLT compute, as you would normally do outside of a pipeline’s execution lifecycle.

Important: These must not be added as source code for the pipeline. The pipeline could give an error because these files will likely cover arbitrary non-DLT code.

utilities

This folder is the recommended location for Python modules that can be imported from other files via direct imports expressed as from <filename> import, as long as their parent folder is hierarchically under the root folder.

You can also import Python modules located outside the root folder, but in that case, you must append the folder path to sys.path in your Python code:

Python
import sys, os
sys.path.append(os.path.abspath('<alternate_path_for_utilities>/utilities'))
from utils import \*

External files

The pipeline browser's External files section shows source code files outside the root folder.

To move an external file to the root folder, such as the transformations folder, follow these steps:

  1. Click Kebab menu for the file in the assets browser and click Move.
  2. Choose the folder to which you want to move the file and click Move.

Files associated with multiple pipelines

A badge is shown in the file's header if a file is associated with more than one pipeline. It has a count of associated pipelines and allows switching to the other ones.

All files section

In addition to the Pipeline section, there is an All files section, where you can open any file in your workspace. Here you can:

  • Open files outside the root folder in a tab without leaving the multi-file editor.
  • Navigate to another pipeline’s source code files and open them to focus the editor on this second pipeline.
  • Move files to the pipeline’s root folder.
  • Include files outside the root folder in the pipeline source code definition.

Run pipeline code

You have three options to run your pipeline code:

  1. Run all source code files in the pipeline: Click Run pipeline or Run pipeline with full table refresh to run all table definitions in all files defined as pipeline source code.

    Run pipeline

    You can also click Dry run to validate the pipeline without updating any data.

  2. Run the code in a single file: Click Run file or Run file with full table refresh to run all table definitions in the current file.

    Run file

  3. Run the code for a single table: Click Run table DLT Run Table Icon for a table definition in a source code file, and click Refresh table or Full refresh table.

    Run table

Directed acyclical graph (DAG)

After you have run or validated all source code files in the pipeline, you will see a directed acyclic graph (DAG). The graph shows the table dependency graph. Each node has different states along the pipeline lifecycle, such as validated, running, or error.

Directed acyclical graph (DAG)

You can toggle the graph on and off by clicking the graph icon in the right side panel. You can also change the orientation to vertical or horizontal.

Clicking on a node will show the data preview and table definition. When you edit a file, the tables defined in that file are highlighted in the graph.

Data previews

The data preview section shows sample data for a selected table.

You will see a preview of the table's data when you click a node in the directed acyclic graph (DAG).

If no table has been selected, go to the Tables section and click View data preview DLT View Data Preview Icon. If you have chosen a table, click All tables to return to all tables.

Execution insights

You can see the table execution insights about the latest pipeline update in the panels at the editor's bottom.

Panel

Description

Tables

Lists all tables with their statuses and metrics. If you select one table, you will see the metrics and performance for that table and a tab for the data preview.

Performance

Query history and profiles for all flows in this pipeline. You can access execution metrics and detailed query plans during and after execution. See Access query history for DLT pipelines for more information.

Issues

Simplified errors and warnings view. You can navigate to the event log from this tray, click an entry to see more details, and then navigate to the place in the code where the error occurred. If the error is in a file other than the one currently displayed, this will redirect you to the file where the error is.

Click View details to see the corresponding event log entry for complete details. Click View logs to see the complete event log.

Code-affixed error indicators are shown for errors associated with a specific part of the code. To get more details, click the error icon or hover over the red line. A pop-up with more information appears. You can then click Quick fix to reveal a set of actions to troubleshoot the error.

Event log

All events triggered during the last pipeline run. Click View logs or any code-affixed error using the Open in logs to navigate this panel.

Pipeline settings

To access the pipeline settings panel, click Settings in the toolbar or click Gear icon in the mini card on the pipeline assets browser.

Pipeline settings

Limitations and known issues

See the following limitations and known issues for the ETL pipeline multi-line editor in DLT:

  1. The workspace browser sidebar will not focus on the pipeline if you start by opening a file in the explorations folder or a notebook, as these files or notebooks are not part of the pipeline source code definition.
    1. To enter the pipeline focus mode in the workspace browser, open a file associated with the pipeline.
  2. Data previews are not supported for regular views.
  3. Notifications cannot be defined on the editor page. Use the legacy settings page link under the Advanced settings section.
  4. Multi-table refreshes can only be performed from the pipeline monitoring page. Use the mini-card in the pipeline browser to navigate to that page.
  5. Run table DLT Run Table Icon can appear at an incorrect position due to line wrapping in your code.
  6. %pip install is not supported from files (the default asset type with the new editor). Using the pipeline's source code definition, you can run %pip install from a notebook. Use Settings to add a notebook.

FAQ

  1. Why use files and not notebooks for source code?

    Notebooks’ cell-based execution was not compatible with DLT. So, we had to turn off features or change their behavior, which led to confusion.

    In the ETL pipeline multi-file editor, the file editor is used as a foundation for a first-class editor for DLT. Features are targeted explicitly to DLT, like Run table DLT Run Table Icon, rather than overloading familiar features with different behavior.

  2. Can I still use notebooks as source code?

    Yes, you can. However, some features, such as Run table DLT Run Table Icon or Run file, will not be present.

    If you have an existing pipeline using notebooks, it will still work in the new editor. However, Databricks recommends switching to files for new pipelines.

  3. How can I add existing code to a newly created Pipeline?

    You can add existing source code files to a new pipeline. To add a folder with existing files, follow these steps:

    1. Click Settings.
    2. Under Source code click Configure paths.
    3. Click Add path and choose the folder for the existing files.
    4. Click Save.

    You can also add individual files:

    1. Click All files in the pipeline assets browser.
    2. Navigate to your file, click Kebab menu, and click Include in pipeline.

    Consider moving these files to the pipeline root folder. If left outside the pipeline root folder, they will be shown in the External files section.

  4. Can I manage the Pipeline source code in Git?

    You can move the root folder to a Git folder in the pipeline asset browser:

    1. Click Kebab menu for the root folder.
    2. Click Move root folder.
    3. Choose a new location for your root folder and click Move.

    See the Root folder section for more information.

    After the move, you will see the familiar Git icon next to your root folder’s name.

    important

    To move the pipeline root folder, use the pipeline assets browser and the above steps. Moving it any other way will break the pipeline configurations, and you must manually configure the correct folder path in Settings.

  5. Can I have multiple Pipelines in the same root folder?

    You can, but Databricks recommends only having a single Pipeline per root folder.

  6. When should I run a dry run?

    Click Dry run to check your code without updating the tables.

  7. When should I use temporary Views, and when should I use materialized views in my code?

    Use temporary views when you do not want to materialize the data. For example, this is a step in a sequence of steps to prepare the data before it is ready to materialize using a streaming table or materialized view registered in the Catalog.