Create your first workflow with Lakeflow Jobs

This article demonstrates using Lakeflow Jobs to orchestrate tasks to read and process a sample dataset. In this quickstart, you:

Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
Save the sample dataset to Unity Catalog.
Create a new notebook and add code to read the dataset from Unity Catalog, filter it by year, and display the results.
Create a new job and configure two tasks using the notebooks.
Run the job and view the results.

Requirements

If your workspace is Unity Catalog-enabled and Serverless Jobs is enabled, by default, the job runs on Serverless compute. You do not need cluster creation permission to run your job with Serverless compute.

Otherwise, you must have cluster creation permission to create job compute or permissions to all-purpose compute resources.

You must have a volume in Unity Catalog. This article uses an example volume named my-volume in a schema named default within a catalog named main. You must have the following permissions in Unity Catalog:

READ VOLUME and WRITE VOLUME, or ALL PRIVILEGES, for the my-volume volume.
USE SCHEMA or ALL PRIVILEGES for the default schema.
USE CATALOG or ALL PRIVILEGES for the main catalog.

To set these permissions, see your Databricks administrator or Unity Catalog privileges and securable objects.

Create the notebooks

Retrieve and save data

To create a notebook to retrieve the sample dataset and save it to Unity Catalog:

Go to your Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
If necessary, change the default language to Python.

Copy the following Python code and paste it into the first cell of the notebook.

Python
import requests

response = requests.get('https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv')
csvfile = response.content.decode('utf-8')
dbutils.fs.put("/Volumes/main/default/my-volume/babynames.csv", csvfile, True)

Read and display filtered data

To create a notebook to read and present the data for filtering:

Go to your Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
If necessary, change the default language to Python.

Copy the following Python code and paste it into the first cell of the notebook.

Python
babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my-volume/babynames.csv")
babynames.createOrReplaceTempView("babynames_table")
years = spark.sql("select distinct(Year) from babynames_table").toPandas()['Year'].tolist()
years.sort()
dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))

Create a job

In your workspace, click Jobs & Pipelines in the sidebar.
Click Create, then Job.

The Tasks tab displays with the create task dialog.
Replace Add a name for your job… with your job name.
In the Task name field, enter a name for the task; for example, retrieve-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the first notebook you created, click the notebook name, and click Confirm.
Click Create task.
Click below the task you just created to add another task.
In the Task name field, enter a name for the task; for example, filter-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the second notebook you created, click the notebook name, and click Confirm.
Click Add under Parameters. In the Key field, enter year. In the Value field, enter 2014.
Click Create task.

Run the job

To run the job immediately, click in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run now in the Active Runs table.

View run details

Click the Runs tab and click the link for the run in the Active Runs table or in the Completed Runs (past 60 days) table.
Click either task to see the output and details. For example, click the filter-baby-names task to view the output and run details for the filter task:

Run with different parameters

To re-run the job and filter baby names for a different year:

Click next to Run now and select Run now with different parameters or click Run now with different parameters in the Active Runs table.
In the Value field, enter 2015.
Click Run.

Requirements​

Create the notebooks​

Retrieve and save data​

Read and display filtered data​

Create a job​

Run the job​

View run details​

Run with different parameters​