Skip to main content

Tutorial: Run a real-time streaming workload

Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.

For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming. For configuration requirements, see Set up real-time mode.

Requirements

Before you begin, ensure you have permissions to create a classic compute cluster that uses the configuration specified in Set up real-time mode. Alternatively, contact your workspace administrator to create a real-time mode cluster for you.

Step 1: Create a notebook

Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.

To create a notebook:

  1. Click New in the sidebar, then click Notebook icon. Notebook.
  2. In the compute drop-down menu, select your real-time mode cluster.
  3. Select Python or Scala as the default language.

Step 2: Run a real-time mode query

Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.

note

The display function with realTime trigger is available in Databricks Runtime 17.1 and above.

Python
inputDF = (
spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
)
display(inputDF, realTime="5 minutes", outputMode="update")

After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.

Understanding the code

The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:

Parameter

Description

format("rate")

Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies.

numPartitions

Sets the number of partitions for the generated data.

rowsPerSecond

Controls how many rows are generated per second.

realTime="5 minutes"

Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures.

outputMode="update"

Real-time mode requires update output mode.

Step 3: Validate results

When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:

  • A timestamp for when the row was generated by the rate source.
  • A monotonically increasing counter that increments with each new row.

The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.

Additional resources

Now that you've run your first real-time query, explore these resources to build production streaming applications with Kafka, Kinesis, and other supported sources: