Jobs quickstart

Preview

This article discusses the orchestration of multiple tasks using Databricks jobs, a feature that is in Public Preview. For information about how to create, run, and manage single-task jobs using the generally-available jobs interface, see Jobs.

This article demonstrates a Databricks job that orchestrates tasks to read and process a sample dataset. In this quickstart, you:

  1. Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
  2. Save the sample dataset to DBFS.
  3. Create a new notebook and add code to read the dataset from DBFS, filter it by year, and display the results.
  4. Create a new job and configure two tasks using the notebooks.
  5. Run the job and view the results.

Requirements

The following are required to complete this quickstart:

Create the notebooks

Retrieve and save data

To create a notebook to retrieve the sample dataset and save it to DBFS:

  1. Go to your Databricks landing page and select Create Blank Notebook or click Create Icon Create in the sidebar and select Notebook from the menu. The Create Notebook dialog appears.

  2. In the Create Notebook dialog, give your notebook a name; for example, Retrieve baby names. Select Python from the Default Language dropdown menu. You can leave Cluster set to the default value. You configure the cluster when you create a task using this notebook.

  3. Click Create.

  4. Copy the following Python code and paste it into the first cell of the notebook.

    import requests
    
    response = requests.get('http://health.data.ny.gov/api/views/myeu-hzra/rows.csv')
    csvfile = response.content.decode('utf-8')
    dbutils.fs.put("dbfs:/FileStore/babynames.csv", csvfile, True)
    

Read and display filtered data

To create a notebook to read and present the data for filtering:

  1. Go to your Databricks landing page and select Create Blank Notebook or click Create Icon Create in the sidebar and select Notebook from the menu. The Create Notebook dialog appears.

  2. In the Create Notebook dialog, give your notebook a name; for example, Filter baby names. Select Python from the Default Language dropdown menu. You can leave Cluster set to the default value. You configure the cluster when you create a task using this notebook.

  3. Click Create.

  4. Copy the following Python code and paste it into the first cell of the notebook.

    babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv")
    babynames.createOrReplaceTempView("babynames_table")
    years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect()
    years.sort()
    dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
    display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
    

Create a job

  1. Click Jobs Icon Jobs in the sidebar.

  2. Click Create Job Button.

    The Tasks tab displays with the create task dialog.

    Create first task dialog
  3. Replace Add a name for your job… with your job name.

  4. In the Task name field, enter a name for the task; for example, retrieve-baby-names.

  5. In the Type drop-down, select Notebook.

  6. Use the file browser to find the first notebook you created, click the notebook name, and click Confirm.

  7. Click Create task.

  8. Click Add Task Button below the task you just created to add another task.

  9. In the Task name field, enter a name for the task; for example, filter-baby-names.

  10. In the Type drop-down, select Notebook.

  11. Use the file browser to find the second notebook you created, click the notebook name, and click Confirm.

  12. Click Add under Parameters. In the Key field, enter year. In the Value field, enter 2014.

  13. Click Create task.

Run the job

To run the job immediately, click Run Now Button in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run Now in the Active Runs table.

View run details

  1. Click the Runs tab and click View Details in the Active Runs table or in the Completed Runs (past 60 days) table.

  2. Click on either task to see the output and details. For example, click the filter-baby-names task to view the status and output for the filter task:

    View filter names results

Run with different parameters

To re-run the job and filter baby names for a different year:

  1. Click Blue Down Caret next to Run Now and select Run Now with Different Parameters or click Run Now with Different Parameters in the Active Runs table.
  2. In the Value field, enter 2015.
  3. Click Run.