Tutorial: Run a real-time streaming workload

Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.

For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming. For configuration requirements, see Set up real-time mode.

Requirements

Before you begin, ensure you have permissions to create a classic compute cluster that uses the configuration specified in Set up real-time mode. Alternatively, contact your workspace administrator to create a real-time mode cluster for you.

Step 1: Create a notebook

Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.

To create a notebook:

Click New in the sidebar, then click Notebook.
In the compute drop-down menu, select your real-time mode cluster.
Select Python or Scala as the default language.

Step 2: Run a real-time mode query

Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.

note

The display function with realTime trigger is available in Databricks Runtime 17.1 and above.

Python
Scala

Python
inputDF = (
  spark
  .readStream
  .format("rate")
  .option("numPartitions", 2)
  .option("rowsPerSecond", 1)
  .load()
)
display(inputDF, realTime="5 minutes", outputMode="update")

Scala
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode

val inputDF = spark
  .readStream
  .format("rate")
  .option("numPartitions", 2)
  .option("rowsPerSecond", 1)
  .load()
display(inputDF, trigger=Trigger.RealTime(), outputMode=OutputMode.Update())

After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.

Understanding the code

The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:

Python
Scala

Parameter	Description
`format("rate")`	Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
`numPartitions`	Sets the number of partitions for the generated data.
`rowsPerSecond`	Controls how many rows are generated per second.
`realTime="5 minutes"`	Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures.
`outputMode="update"`	Real-time mode requires update output mode.

Parameter	Description
`format("rate")`	Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
`numPartitions`	Sets the number of partitions for the generated data.
`rowsPerSecond`	Controls how many rows are generated per second.
`realTime="5 minutes"`	Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures.
`outputMode="update"`	Real-time mode requires update output mode.

Parameter	Description
`format("rate")`	Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
`numPartitions`	Sets the number of partitions for the generated data.
`rowsPerSecond`	Controls how many rows are generated per second.
`Trigger.RealTime()`	Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example `Trigger.RealTime("5 minutes")`.
`OutputMode.Update()`	Real-time mode requires update output mode.

Parameter	Description
`format("rate")`	Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.
`numPartitions`	Sets the number of partitions for the generated data.
`rowsPerSecond`	Controls how many rows are generated per second.
`Trigger.RealTime()`	Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example `Trigger.RealTime("5 minutes")`.
`OutputMode.Update()`	Real-time mode requires update output mode.

Step 3: Validate results

When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:

A timestamp for when the row was generated by the rate source.
A monotonically increasing counter that increments with each new row.

The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.

Additional resources

Now that you've run your first real-time query, explore these resources to build production streaming applications with Kafka, Kinesis, and other supported sources:

Requirements​

Step 1: Create a notebook​

Step 2: Run a real-time mode query​

Understanding the code​

Step 3: Validate results​

Additional resources​