Skip to main content

Setup a recurring query with backfill support using Lakeflow Jobs

A common scenario is a query that is run on a regular schedule with a job orchestrating it. For example, at the end of each day, a query is run to update a system based on that day's changes to source datasets. This tutorial takes you through creating a query with parameters that identify the time frame to import data for, and then creating a job to schedule that query to run daily.

The query and parameters created in this tutorial match best practices, and are set up to allow you to run a backfill job later, if needed.

Prerequisite

To complete this tutorial, you must have access to the system tables in Databricks.

Step 1: Create a query

For this tutorial, you create a query that uses parameters to describe which data to pull. For example, this tutorial uses billing data from the system tables to calculate daily Databricks costs.

The query uses two parameters:

Parameter

Use

data_interval_end

The date that the job is run (for the recurring schedule), which is the end of the time range that should be processed. Or, for backfill jobs, the end date for which data to backfill.

lookback_days

How many days worth of data to query. The query looks back from the data_interval_end, which is typically the time or date that the query is run, so you want to lookback, rather than forward.

Follow these steps to create a query:

  1. From your workspace, click Plus icon. New, then Notebook icon. Notebook to create a new notebook.

  2. The name defaults to Untitled Notebook <date-time>. Click the name at the top of the notebook, and give it a descriptive name, such as Query billing with parameters tutorial.

  3. At the top of the notebook editor, choose SQL from the language selector.

  4. In the first cell, add the following code. Replace <catalog> and <schema> with a catalog and schema that you want to use and to which you have access.

    SQL
    USE CATALOG <catalog>;
    USE SCHEMA <schema>;

    CREATE TABLE IF NOT EXISTS tutorial_databricks_product_spend (billing_origin_product STRING, usage_date DATE, total_dollar_cost DECIMAL(12, 2));

    -- Process the last N days specified by :lookback_days ending on :data_interval_end
    INSERT INTO TABLE tutorial_databricks_product_spend
    REPLACE WHERE
    usage_date >= date_add(:data_interval_end, - CAST(:lookback_days AS INT)) AND usage_date < :data_interval_end
    SELECT
    usage.billing_origin_product,
    usage.usage_date,
    SUM(usage.usage_quantity * list_prices.pricing.effective_list.default) AS total_dollar_cost
    FROM
    system.billing.usage AS usage
    JOIN system.billing.list_prices AS list_prices
    ON usage.sku_name = list_prices.sku_name
    AND usage.usage_end_time >= list_prices.price_start_time
    AND (
    list_prices.price_end_time IS NULL
    OR usage.usage_end_time < list_prices.price_end_time
    )
    WHERE
    usage.usage_date >=
    date_add(:data_interval_end, -CAST(:lookback_days AS INT))
    AND usage.usage_date <
    :data_interval_end
    GROUP BY
    usage.billing_origin_product,
    usage.usage_date
  5. Add the two parameters by clicking Edit, Add parameter. The parameters should have the following names and default values:

    Name

    Default value

    lookback_days

    1

    data_interval_end

    <none>. This parameter is always required.

    To learn about parameters and how to access them in different task types or from Python, see Access parameter values from a task.

Your query is now ready. The query reads the data for a full day from the system tables, and then replaces the existing data in the destination table using REPLACE WHERE. By replacing data instead of inserting the data, the query does no harm if a day is run a second time. In fact, this allows you to re-run a day in case of an error in processing, or late-arriving data.

You can test the query by following these steps:

  1. Provide a value for data_interval_end above the cells of the notebook, in the form yyyy-mm-dd, for example, 2025-10-02.
  2. Optionally, click Circle icon. Connect and choose a compute resource to use.
  3. Click Play icon. Run all.
  4. After the run completes, you can view the table that was created by opening the Catalog icon. Catalog from the left menu, then choosing your catalog and schema that you set in the query.

Next, create a recurring schedule for the query.

Step 2: Create a job to schedule the query

  1. In your workspace, click Workflows icon. Jobs & Pipelines in the sidebar.

  2. Click Create, then Job. The Tasks tab displays with the empty task pane.

    note

    If the Lakeflow Jobs UI is ON, click the Notebook tile to configure the first task. If the Notebook tile is not available, click Add another task type and search for Notebook.

  3. (Optional) Replace the name of the job, which defaults to New Job <date-time>, with your job name.

  4. In the Task name field, enter a name for the task; for example, tutorial-databricks-spend.

  5. If necessary, select Notebook from the Type drop-down menu.

  6. In the Source drop-down menu, select Workspace, which allows you to use a notebook you saved previously.

  7. For Path, use the file browser to find the first notebook you created, click the notebook name, and click Confirm.

  8. Click Plus icon. Add in the Parameters section. Add the lookback_days parameter, with a Value of 1.

  9. Click Plus icon. Add in the Parameters section. Add the data_interval_end parameter. Click { } next to the Value to see a list of parameterized values. Select {{job.trigger.time.iso_date}} from the list to insert it for the value.

    This passes the date the job run was triggered for as a parameter.

    note

    You could also use {{job.trigger.time.iso_datetime}} to pass the time, if you had a query that looked back for a smaller amount of time, like one hour instead of one day. In this case, either option works in our query, but iso_date shows the intent of the parameter.

  10. Click Create task.

  11. On the details panel to the right of the task, under Schedules & Triggers, click Add trigger.

  12. Under Trigger type choose Scheduled.

  13. Keep the defaults of an active trigger that runs once per day.

  14. Click Save.

Your job now runs your query daily. By default, this runs at the same time of day that you created the trigger. You can edit the trigger, and select the Advanced trigger type to set a specific time.

note

If you do not want to incur the charges of having this tutorial job run daily, you can click Pause icon. Pause under the schedule you just created. This keeps the schedule, but it won't run until you unpause it. You can always run it manually at any time.

Next run a backfill to load older data into the table.

Step 3: Run a backfill for older data

You can run a backfill to fill in older data. For example, if you want the last week's data populated in your table. The following instructions creates 7 backfill runs to process each of the last 7 days data.

  1. Click the down arrow (re:[Chevron down icon]) next to Run now at the top of the page.

  2. Select Run backfill from the drop down that appears. This opens the Run backfill dialog.

  3. Change the date range to the range that you want to backfill. Choose the Start as 7 days prior, at 12:00 AM, and the End as today, also at 12:00 AM. For example, you might choose 09/14/2025, 12:00 AM as your start time and 09/21/2025, 12:00 AM as your end time.

  4. Select Every 1 Day for the time interval of each backfill.

  5. Under Job parameters, your existing parameters are shown with keys and values. Confirm that the data_interval_end parameter is set to {{backfill.iso_datetime}} and lookback_days is 1.

  6. Click Run to start the backfill runs. This triggers 7 runs for backfilling, one for each day.

Backfills can run in parallel or sequentially, depending on your job setup. For more details about backfills, see Backfill jobs.