Setup a recurring query with backfill support using Lakeflow Jobs
A common scenario is a query that is run on a regular schedule with a job orchestrating it. For example, at the end of each day, a query is run to update a system based on that day's changes to source datasets. This tutorial takes you through creating a query with parameters that identify the time frame to import data for, and then creating a job to schedule that query to run daily.
The query and parameters created in this tutorial match best practices, and are set up to allow you to run a backfill job later, if needed.
Prerequisite
To complete this tutorial, you must have access to the system tables in Databricks.
Step 1: Create a query
For this tutorial, you create a query that uses parameters to describe which data to pull. For example, this tutorial uses billing data from the system tables to calculate daily Databricks costs.
The query uses two parameters:
Parameter | Use |
---|---|
| The date that the job is run (for the recurring schedule), which is the end of the time range that should be processed. Or, for backfill jobs, the end date for which data to backfill. |
| How many days worth of data to query. The query looks back from the |
Follow these steps to create a query:
-
From your workspace, click
New, then
Notebook to create a new notebook.
-
The name defaults to
Untitled Notebook <date-time>
. Click the name at the top of the notebook, and give it a descriptive name, such asQuery billing with parameters tutorial
. -
At the top of the notebook editor, choose SQL from the language selector.
-
In the first cell, add the following code. Replace
<catalog>
and<schema>
with a catalog and schema that you want to use and to which you have access.SQLUSE CATALOG <catalog>;
USE SCHEMA <schema>;
CREATE TABLE IF NOT EXISTS tutorial_databricks_product_spend (billing_origin_product STRING, usage_date DATE, total_dollar_cost DECIMAL(12, 2));
-- Process the last N days specified by :lookback_days ending on :data_interval_end
INSERT INTO TABLE tutorial_databricks_product_spend
REPLACE WHERE
usage_date >= date_add(:data_interval_end, - CAST(:lookback_days AS INT)) AND usage_date < :data_interval_end
SELECT
usage.billing_origin_product,
usage.usage_date,
SUM(usage.usage_quantity * list_prices.pricing.effective_list.default) AS total_dollar_cost
FROM
system.billing.usage AS usage
JOIN system.billing.list_prices AS list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_end_time >= list_prices.price_start_time
AND (
list_prices.price_end_time IS NULL
OR usage.usage_end_time < list_prices.price_end_time
)
WHERE
usage.usage_date >=
date_add(:data_interval_end, -CAST(:lookback_days AS INT))
AND usage.usage_date <
:data_interval_end
GROUP BY
usage.billing_origin_product,
usage.usage_date -
Add the two parameters by clicking Edit, Add parameter. The parameters should have the following names and default values:
Name
Default value
lookback_days
1
data_interval_end
<none>. This parameter is always required.
To learn about parameters and how to access them in different task types or from Python, see Access parameter values from a task.
Your query is now ready. The query reads the data for a full day from the system tables, and then replaces the existing data in the destination table using REPLACE WHERE
. By replacing data instead of inserting the data, the query does no harm if a day is run a second time. In fact, this allows you to re-run a day in case of an error in processing, or late-arriving data.
You can test the query by following these steps:
- Provide a value for
data_interval_end
above the cells of the notebook, in the formyyyy-mm-dd
, for example,2025-10-02
. - Optionally, click
Connect and choose a compute resource to use.
- Click
Run all.
- After the run completes, you can view the table that was created by opening the
Catalog from the left menu, then choosing your catalog and schema that you set in the query.
Next, create a recurring schedule for the query.
Step 2: Create a job to schedule the query
-
In your workspace, click
Jobs & Pipelines in the sidebar.
-
Click Create, then Job. The Tasks tab displays with the empty task pane.
noteIf the Lakeflow Jobs UI is ON, click the Notebook tile to configure the first task. If the Notebook tile is not available, click Add another task type and search for Notebook.
-
(Optional) Replace the name of the job, which defaults to
New Job <date-time>
, with your job name. -
In the Task name field, enter a name for the task; for example,
tutorial-databricks-spend
. -
If necessary, select Notebook from the Type drop-down menu.
-
In the Source drop-down menu, select Workspace, which allows you to use a notebook you saved previously.
-
For Path, use the file browser to find the first notebook you created, click the notebook name, and click Confirm.
-
Click
Add in the Parameters section. Add the
lookback_days
parameter, with a Value of1
. -
Click
Add in the Parameters section. Add the
data_interval_end
parameter. Click { } next to the Value to see a list of parameterized values. Select{{job.trigger.time.iso_date}}
from the list to insert it for the value.This passes the date the job run was triggered for as a parameter.
noteYou could also use
{{job.trigger.time.iso_datetime}}
to pass the time, if you had a query that looked back for a smaller amount of time, like one hour instead of one day. In this case, either option works in our query, butiso_date
shows the intent of the parameter. -
Click Create task.
-
On the details panel to the right of the task, under Schedules & Triggers, click Add trigger.
-
Under Trigger type choose Scheduled.
-
Keep the defaults of an active trigger that runs once per day.
-
Click Save.
Your job now runs your query daily. By default, this runs at the same time of day that you created the trigger. You can edit the trigger, and select the Advanced trigger type to set a specific time.
If you do not want to incur the charges of having this tutorial job run daily, you can click Pause under the schedule you just created. This keeps the schedule, but it won't run until you unpause it. You can always run it manually at any time.
Next run a backfill to load older data into the table.
Step 3: Run a backfill for older data
You can run a backfill to fill in older data. For example, if you want the last week's data populated in your table. The following instructions creates 7 backfill runs to process each of the last 7 days data.
-
Click the down arrow (re:[Chevron down icon]) next to Run now at the top of the page.
-
Select Run backfill from the drop down that appears. This opens the Run backfill dialog.
-
Change the date range to the range that you want to backfill. Choose the Start as 7 days prior, at 12:00 AM, and the End as today, also at 12:00 AM. For example, you might choose
09/14/2025, 12:00 AM
as your start time and09/21/2025, 12:00 AM
as your end time. -
Select Every
1
Day
for the time interval of each backfill. -
Under Job parameters, your existing parameters are shown with keys and values. Confirm that the
data_interval_end
parameter is set to{{backfill.iso_datetime}}
andlookback_days
is1
. -
Click Run to start the backfill runs. This triggers 7 runs for backfilling, one for each day.
Backfills can run in parallel or sequentially, depending on your job setup. For more details about backfills, see Backfill jobs.