Skip to main content

Import Python modules from Git folders or workspace files

You can store Python code in Databricks Git folders or in workspace files and then import that Python code into your Lakeflow Declarative Pipelines. For more information about working with modules in Git folders or workspace files, see Work with Python and R modules.

To import a Python file, you have multiple options:

  • Include the Python module in your pipeline, as a utility file. This works best if the module is specific to the pipeline.
  • Add a shared module to your pipeline environment, in any pipeline that needs to use it.
  • Import a module in your workspace directly into your Python source, with an import statement.

Include a Python module in a pipeline

You can create a Python module as part of your pipeline. Your pipeline root folder is automatically appended to the sys.path. This allows you to reference the module directly in your pipeline Python source code.

The following example demonstrates creating a Python module under your pipeline root folder, and referencing it from a Python source file in your pipeline source:

  1. Open your pipeline in the pipeline editor.

  2. From the pipeline asset browser on the left, click Plus icon. Add, then choose Utility from the menu.

  3. Enter my_utils.py for the Name.

  4. Leave the default path, and click Create.

    This creates the my_utils.py file in the utilities folder of your pipeline, and creates the utilities folder, if it doesn't exist. The files in this folder are not added to your pipeline source by default, but are available to be called from the .py files that are part of your source code.

    By default, the utility file has a sample function called distance_km() that takes a distance in miles and converts it.

  5. In a Python source file in your transformations folder (you can create one by choosing Plus icon. Add, then selecting Transformation from the menu), add the following code:

    Python
    from utilities import my_utils

You can now call functions in my_utils from that Python file. You must add the import statement from any Python file that needs to call functions in the module.

Add a Python module to your pipeline environment

If you want to share a Python module across multiple pipelines, you can save the module anywhere in your workspace files, and reference it from the environment of any pipeline that needs to use it. You can reference Python modules that are:

  • Individual Python (.py) files.
  • Python project packaged as a Python wheel (.whl) file.
  • Unpackaged Python project with a pyproject.toml file (to define the project name and version).

The following example shows how to add a dependency to a pipeline.

  1. Open your pipeline in the pipeline editor.

  2. Click Gear icon. Settings from the top bar.

  3. In the Pipeline settings slide out, under Pipeline environment, click Pencil icon. Edit environment.

  4. Add a dependency. For example to add a file in your workspace, you might add /Volumes/libraries/path/to/python_files/file.py. For a Python wheel stored in Git folders, your path might look like /Workspace/libraries/path/to/wheel_files/file.whl.

    You can add a file with no path, or a relative path, if it is in the root folder of the pipeline.

note

You can also add a path to a shared folder using the dependencies to allow import statements in your code to find the modules you want to import. For example, -e /Workspace/Users/<user_name>/path/to/add/.

Import a Python module using the import statement

You can also directly reference a workspace file in your Python source code.

  • If the file is in the utilities folder of your pipeline, you can reference it without a path:

    Python
    from utilities import my_module
  • If the file is anywhere else, you can import it by first appending the path of the module to the sys.path:

    Python
    import sys, os
    sys.path.append(os.path.abspath('<module-path>'))

    from my_module import *
  • You can also append to the sys.path for all pipeline source files by adding the path to the pipeline environment, as described in the previous section.

Example of importing queries as Python modules

The following example demonstrates importing dataset queries as Python modules from workspace files. Although this example describes using workspace files to store the pipeline source code, you can use it with source code stored in a Git folder.

To run this example, use the following steps:

  1. Click Workspaces icon. Workspace in the sidebar of your Databricks workspace to open the workspace browser.

  2. Use the workspace browser to select a directory for the Python modules.

  3. Click Kebab menu icon. in the rightmost column of the selected directory and click Create > File.

  4. Enter a name for the file, for example, clickstream_raw_module.py. The file editor opens. To create a module to read source data into a table, enter the following in the editor window:

    Python
    from dlt import *

    json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"

    def create_clickstream_raw_table(spark):
    @table
    def clickstream_raw():
    return (
    spark.read.json(json_path)
    )
  5. To create a module that creates a new table containing prepared data, create a new file in the same directory, enter a name for the file, for example, clickstream_prepared_module.py, and enter the following in the new editor window:

    Python
    from clickstream_raw_module import *
    from dlt import read
    from pyspark.sql.functions import *
    from pyspark.sql.types import *

    def create_clickstream_prepared_table(spark):
    create_clickstream_raw_table(spark)
    @table
    @expect("valid_current_page_title", "current_page_title IS NOT NULL")
    @expect_or_fail("valid_count", "click_count > 0")
    def clickstream_prepared():
    return (
    read("clickstream_raw")
    .withColumn("click_count", expr("CAST(n AS INT)"))
    .withColumnRenamed("curr_title", "current_page_title")
    .withColumnRenamed("prev_title", "previous_page_title")
    .select("current_page_title", "click_count", "previous_page_title")
    )
  6. Next, create a Python file in your pipeline source. From the pipeline editor, choose Plus icon. Add, then tranformation.

  7. Name your file and confirm Python is the default language.

  8. Click Create.

  9. Enter the following example code in the notebook.

    note

    If your notebook imports modules or packages from a workspace files path or a Git folders path different from the notebook directory, you must manually append the path to the files using sys.path.append().

    If you are importing a file from a Git folder, you must prepend /Workspace/ to the path. For example, sys.path.append('/Workspace/...'). Omitting /Workspace/ from the path results in an error.

    If the modules or packages are stored in the same directory as the notebook, you do not need to append the path manually. You also do not need to manually append the path when importing from the root directory of a Git folder because the root directory is automatically appended to the path.

    Python
    import sys, os
    sys.path.append(os.path.abspath('<module-path>'))

    import dlt
    from clickstream_prepared_module import *
    from pyspark.sql.functions import *
    from pyspark.sql.types import *

    create_clickstream_prepared_table(spark)

    @dlt.table(
    comment="A table containing the top pages linking to the Apache Spark page."
    )
    def top_spark_referrers():
    return (
    spark.read.table("catalog_name.schema_name.clickstream_prepared")
    .filter(expr("current_page_title == 'Apache_Spark'"))
    .withColumnRenamed("previous_page_title", "referrer")
    .sort(desc("click_count"))
    .select("referrer", "click_count")
    .limit(10)
    )

    Replace <module-path> with the path to the directory containing the Python modules to import.

  10. To run the pipeline, click Run pipeline.