Import Python modules from Git folders or workspace files
You can store Python code in Databricks Git folders or in workspace files and then import that Python code into your Lakeflow Declarative Pipelines. For more information about working with modules in Git folders or workspace files, see Work with Python and R modules.
To import a Python file, you have multiple options:
- Include the Python module in your pipeline, as a utility file. This works best if the module is specific to the pipeline.
- Add a shared module to your pipeline environment, in any pipeline that needs to use it.
- Import a module in your workspace directly into your Python source, with an
import
statement.
Include a Python module in a pipeline
You can create a Python module as part of your pipeline. Your pipeline root folder is automatically appended to the sys.path
. This allows you to reference the module directly in your pipeline Python source code.
The following example demonstrates creating a Python module under your pipeline root folder, and referencing it from a Python source file in your pipeline source:
-
Open your pipeline in the pipeline editor.
-
From the pipeline asset browser on the left, click
Add, then choose Utility from the menu.
-
Enter
my_utils.py
for the Name. -
Leave the default path, and click Create.
This creates the
my_utils.py
file in theutilities
folder of your pipeline, and creates theutilities
folder, if it doesn't exist. The files in this folder are not added to your pipeline source by default, but are available to be called from the.py
files that are part of your source code.By default, the utility file has a sample function called
distance_km()
that takes a distance in miles and converts it. -
In a Python source file in your transformations folder (you can create one by choosing
Add, then selecting Transformation from the menu), add the following code:
Pythonfrom utilities import my_utils
You can now call functions in my_utils
from that Python file. You must add the import
statement from any Python file that needs to call functions in the module.
Add a Python module to your pipeline environment
If you want to share a Python module across multiple pipelines, you can save the module anywhere in your workspace files, and reference it from the environment of any pipeline that needs to use it. You can reference Python modules that are:
- Individual Python (
.py
) files. - Python project packaged as a Python wheel (
.whl
) file. - Unpackaged Python project with a
pyproject.toml
file (to define the project name and version).
The following example shows how to add a dependency to a pipeline.
-
Open your pipeline in the pipeline editor.
-
Click
Settings from the top bar.
-
In the Pipeline settings slide out, under Pipeline environment, click
Edit environment.
-
Add a dependency. For example to add a file in your workspace, you might add
/Volumes/libraries/path/to/python_files/file.py
. For a Python wheel stored in Git folders, your path might look like/Workspace/libraries/path/to/wheel_files/file.whl
.You can add a file with no path, or a relative path, if it is in the root folder of the pipeline.
You can also add a path to a shared folder using the dependencies to allow import
statements in your code to find the modules you want to import. For example, -e /Workspace/Users/<user_name>/path/to/add/
.
Import a Python module using the import
statement
You can also directly reference a workspace file in your Python source code.
-
If the file is in the
utilities
folder of your pipeline, you can reference it without a path:Pythonfrom utilities import my_module
-
If the file is anywhere else, you can import it by first appending the path of the module to the
sys.path
:Pythonimport sys, os
sys.path.append(os.path.abspath('<module-path>'))
from my_module import * -
You can also append to the
sys.path
for all pipeline source files by adding the path to the pipeline environment, as described in the previous section.
Example of importing queries as Python modules
The following example demonstrates importing dataset queries as Python modules from workspace files. Although this example describes using workspace files to store the pipeline source code, you can use it with source code stored in a Git folder.
To run this example, use the following steps:
-
Click
Workspace in the sidebar of your Databricks workspace to open the workspace browser.
-
Use the workspace browser to select a directory for the Python modules.
-
Click
in the rightmost column of the selected directory and click Create > File.
-
Enter a name for the file, for example,
clickstream_raw_module.py
. The file editor opens. To create a module to read source data into a table, enter the following in the editor window:Pythonfrom dlt import *
json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"
def create_clickstream_raw_table(spark):
@table
def clickstream_raw():
return (
spark.read.json(json_path)
) -
To create a module that creates a new table containing prepared data, create a new file in the same directory, enter a name for the file, for example,
clickstream_prepared_module.py
, and enter the following in the new editor window:Pythonfrom clickstream_raw_module import *
from dlt import read
from pyspark.sql.functions import *
from pyspark.sql.types import *
def create_clickstream_prepared_table(spark):
create_clickstream_raw_table(spark)
@table
@expect("valid_current_page_title", "current_page_title IS NOT NULL")
@expect_or_fail("valid_count", "click_count > 0")
def clickstream_prepared():
return (
read("clickstream_raw")
.withColumn("click_count", expr("CAST(n AS INT)"))
.withColumnRenamed("curr_title", "current_page_title")
.withColumnRenamed("prev_title", "previous_page_title")
.select("current_page_title", "click_count", "previous_page_title")
) -
Next, create a Python file in your pipeline source. From the pipeline editor, choose
Add, then tranformation.
-
Name your file and confirm Python is the default language.
-
Click Create.
-
Enter the following example code in the notebook.
noteIf your notebook imports modules or packages from a workspace files path or a Git folders path different from the notebook directory, you must manually append the path to the files using
sys.path.append()
.If you are importing a file from a Git folder, you must prepend
/Workspace/
to the path. For example,sys.path.append('/Workspace/...')
. Omitting/Workspace/
from the path results in an error.If the modules or packages are stored in the same directory as the notebook, you do not need to append the path manually. You also do not need to manually append the path when importing from the root directory of a Git folder because the root directory is automatically appended to the path.
Pythonimport sys, os
sys.path.append(os.path.abspath('<module-path>'))
import dlt
from clickstream_prepared_module import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
create_clickstream_prepared_table(spark)
@dlt.table(
comment="A table containing the top pages linking to the Apache Spark page."
)
def top_spark_referrers():
return (
spark.read.table("catalog_name.schema_name.clickstream_prepared")
.filter(expr("current_page_title == 'Apache_Spark'"))
.withColumnRenamed("previous_page_title", "referrer")
.sort(desc("click_count"))
.select("referrer", "click_count")
.limit(10)
)Replace
<module-path>
with the path to the directory containing the Python modules to import. -
To run the pipeline, click Run pipeline.