Tutorial: Build an ETL pipeline with Apache Spark on the Databricks platform

This tutorial shows you how to develop and deploy your first ETL (extract, transform, and load) pipeline for data orchestration with Apache Spark. Although this tutorial uses Databricks all-purpose compute, you can also use serverless compute if it's enabled for your workspace.

You can also use Lakeflow Declarative Pipelines to build ETL pipelines. Databricks Lakeflow Declarative Pipelines reduces the complexity of building, deploying, and maintaining production ETL pipelines. See Tutorial: Build an ETL pipeline with Lakeflow Declarative Pipelines.

By the end of this article, you will know how to:

Launching a Databricks all-purpose compute cluster.
Creating a Databricks notebook.
Configuring incremental data ingestion to Delta Lake with Auto Loader.
Executing notebook cells to process, query, and preview data.
Scheduling a notebook as a Databricks job.

This tutorial uses interactive notebooks to complete common ETL tasks in Python or Scala.

You can also use the Databricks Terraform provider to create this article's resources. See Create clusters, notebooks, and jobs with Terraform.

Requirements

You are logged into a Databricks workspace.
You have permission to create a cluster.

note

If you do not have cluster control privileges, you can still complete most of the steps below as long as you have access to a cluster.

Step 1: Create a cluster

To do exploratory data analysis and data engineering, create a cluster to provide the compute resources needed to execute commands.

Click Compute in the sidebar.
On the Compute page, click Create Cluster. This opens the New Cluster page.
Specify a unique name for the cluster, leave the remaining values in their default state, and click Create Cluster.

To learn more about Databricks clusters, see Compute.

Step 2: Create a Databricks notebook

To create a notebook in your workspace, click New in the sidebar, and then click Notebook. A blank notebook opens in the workspace.

To learn more about creating and managing notebooks, see Manage notebooks.

Step 3: Configure Auto Loader to ingest data to Delta Lake

Databricks recommends using Auto Loader for incremental data ingestion. Auto Loader automatically detects and processes new files as they arrive in cloud object storage.

Databricks recommends storing data with Delta Lake. Delta Lake is an open source storage layer that provides ACID transactions and enables the data lakehouse. Delta Lake is the default format for tables created in Databricks.

To configure Auto Loader to ingest data to a Delta Lake table, copy and paste the following code into the empty cell in your notebook:

Python
Scala

Python
# Import functions
from pyspark.sql.functions import col, current_timestamp

# Define variables used in code below
file_path = "/databricks-datasets/structured-streaming/events"
username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0]
table_name = f"{username}_etl_quickstart"
checkpoint_path = f"/tmp/{username}/_checkpoint/etl_quickstart"

# Clear out data from previous demo execution
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
dbutils.fs.rm(checkpoint_path, True)

# Configure Auto Loader to ingest JSON data to a Delta table
(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable(table_name))

Scala
// Imports
import org.apache.spark.sql.functions.current_timestamp
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._

// Define variables used in code below
val file_path = "/databricks-datasets/structured-streaming/events"
val username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first.get(0)
val table_name = s"${username}_etl_quickstart"
val checkpoint_path = s"/tmp/${username}/_checkpoint"

// Clear out data from previous demo execution
spark.sql(s"DROP TABLE IF EXISTS ${table_name}")
dbutils.fs.rm(checkpoint_path, true)

// Configure Auto Loader to ingest JSON data to a Delta table
spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load(file_path)
  .select($"*", $"_metadata.file_path".as("source_file"), current_timestamp.as("processing_time"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(Trigger.AvailableNow)
  .toTable(table_name)

note

The variables defined in this code should allow you to safely execute it without risk of conflicting with existing workspace assets or other users. Restricted network or storage permissions will raise errors when executing this code; contact your workspace administrator to troubleshoot these restrictions.

To learn more about Auto Loader, see What is Auto Loader?.

Step 4: Process and interact with data

Notebooks execute logic cell-by-cell. To execute the logic in your cell:

To run the cell you completed in the previous step, select the cell and press SHIFT+ENTER.
To query the table you've just created, copy and paste the following code into an empty cell, then press SHIFT+ENTER to run the cell.
- Python
- Scala
Python
df = spark.read.table(table_name)
Scala
val df = spark.read.table(table_name)
To preview the data in your DataFrame, copy and paste the following code into an empty cell, then press SHIFT+ENTER to run the cell.
- Python
- Scala
Python
display(df)
Scala
display(df)

To learn more about interactive options for visualizing data, see Visualizations in Databricks notebooks and SQL editor.

Step 5: Schedule a job

You can run Databricks notebooks as production scripts by adding them as a task in a Databricks job. In this step, you will create a new job that you can trigger manually.

To schedule your notebook as a task:

Click Schedule on the right side of the header bar.
Enter a unique name for the Job name.
Click Manual.
In the Cluster drop-down, select the cluster you created in step 1.
Click Create.
In the window that appears, click Run now.
To see the job run results, click the icon next to the Last run timestamp.

For more information on jobs, see What are jobs?.

Additional integrations

Learn more about integrations and tools for data engineering with Databricks:

Requirements​

Step 1: Create a cluster​

Step 2: Create a Databricks notebook​

Step 3: Configure Auto Loader to ingest data to Delta Lake​

Step 4: Process and interact with data​

Step 5: Schedule a job​

Additional integrations​