Use a Python wheel file in Lakeflow Jobs
A Python wheel file is a standard way to package and distribute the files required to run a Python application. Using the Python wheel task, you can ensure fast and reliable installation of Python code in your jobs. This article provides an example of creating a Python wheel file and a job that runs the application packaged in the Python wheel file. In this example, you will:
- Create the Python files defining an example application.
- Bundle the example files into a Python wheel file.
- Create a job to run the Python wheel file.
- Run the job and view the results.
Before you begin
You need the following to complete this example:
-
Python3
-
The Python
wheel
andsetuptool
packages. You can usepip
to install these packages. For example, you can run the following command to install these packages:Bashpip install wheel setuptools
Step 1: Create a local directory for the example
Create a local directory to hold the example code and generated artifacts, for example, databricks_wheel_test
.
Step 2: Create the example Python script
The following Python example is a simple script that reads input arguments and prints out those arguments. Copy this script and save it to a path called my_test_code/__main__.py
in the directory you created in the previous step.
"""
The entry point of the Python Wheel
"""
import sys
def main():
# This method will print the provided arguments
print('Hello from my func')
print('Got arguments:')
print(sys.argv)
if __name__ == '__main__':
main()
Step 3: Create a metadata file for the package
The following file contains metadata describing the package. Save this to a path called my_test_code/__init__.py
in the directory you created in step 1.
__version__ = "0.0.1"
__author__ = "Databricks"
Step 4: Create the Python wheel file
Converting the Python artifacts into a Python wheel file requires specifying package metadata such as the package name and entry points. The following script defines this metadata.
The entry_points
defined in this script are used to run the package in the Databricks workflow. In each value in entry_points
, the value before =
(in this example, run
) is the name of the entry point and is used to configure the Python wheel task.
-
Save this script in a file named
setup.py
in the root of the directory you created in step 1:Pythonfrom setuptools import setup, find_packages
import my_test_code
setup(
name='my_test_package',
version=my_test_code.__version__,
author=my_test_code.__author__,
url='https://databricks.com',
author_email='john.doe@databricks.com',
description='my test wheel',
packages=find_packages(include=['my_test_code']),
entry_points={
'group_1': 'run=my_test_code.__main__:main'
},
install_requires=[
'setuptools'
]
) -
Change into the directory you created in step 1, and run the following command to package your code into the Python wheel distribution:
Bashpython3 setup.py bdist_wheel
This command creates the Python wheel file and saves it to the dist/my_test_package-0.0.1-py3.none-any.whl
file in your directory.
Step 5. Create a job to run the Python wheel file
-
In your workspace, click
Jobs & Pipelines in the sidebar.
-
Click Create, then Job.
The Tasks tab displays with the empty task pane.
noteIf the Lakeflow Jobs UI is ON, click the Python wheel tile to configure the first task. If the Python wheel tile is not available, click Add another task type and search for Python wheel.
-
Optionally, replace the name of the job, which defaults to
New Job <date-time>
, with your job name. -
In Task name, enter a name for the task.
-
If necessary, select Python wheel from the Type drop-down menu.
-
In Package name, enter
my_test_package
. The Package Name value is the name of the Python package to import. In this example, the package name is the value assigned to thename
parameter insetup.py
. -
In Entry point, enter
run
. The entry point is one of the values specified in theentry_points
collection in thesetup.py
script. In this example,run
is the only entry point defined. -
In Compute, select an existing job cluster or Add new job cluster.
-
Specify your Python wheel file:
- In the Environment and Libraries drop-down, click the
next to Default, to edit it. Alternatively, click Add new environment to configure a new environment.
- In the Configure environment dialog, click Add dependency.
- Click
the folder icon to open the file browser. Drag and drop the wheel file that you created in step 4 into the Select a dependency dialog.
- Click Confirm.
- In the Environment and Libraries drop-down, click the
-
In Parameters, select Positional arguments or Keyword arguments to enter the key and the value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments.
- To enter positional arguments, enter parameters as a JSON-formatted array of strings, for example:
["first argument","first value","second argument","second value"]
. - To enter keyword arguments, click + Add and enter a key and value. Click + Add again to enter more arguments.
- To enter positional arguments, enter parameters as a JSON-formatted array of strings, for example:
-
Click Create task.
Step 6: Run the job and view the job run details
Click to run the workflow. To view details for the run, click the Runs tab, then click the link in the Start time column for the run in the job runs view.
When the run completes, the output displays in the Output panel, including the arguments passed to the task.
Next steps
To learn more about creating and running jobs, see Lakeflow Jobs.