Tutorial: Run a real-time streaming workload
Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.
For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming. For configuration requirements, see Set up real-time mode.
Requirements
Before you begin, ensure you have permissions to create a classic compute cluster that uses the configuration specified in Set up real-time mode. Alternatively, contact your workspace administrator to create a real-time mode cluster for you.
Step 1: Create a notebook
Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.
To create a notebook:
- Click New in the sidebar, then click
Notebook.
- In the compute drop-down menu, select your real-time mode cluster.
- Select Python or Scala as the default language.
Step 2: Run a real-time mode query
Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.
The display function with realTime trigger is available in Databricks Runtime 17.1 and above.
- Python
- Scala
inputDF = (
spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
)
display(inputDF, realTime="5 minutes", outputMode="update")
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
val inputDF = spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
display(inputDF, trigger=Trigger.RealTime(), outputMode=OutputMode.Update())
After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.
Understanding the code
The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:
- Python
- Scala
Parameter | Description |
|---|---|
| Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
| Sets the number of partitions for the generated data. |
| Controls how many rows are generated per second. |
| Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures. |
| Real-time mode requires update output mode. |
Parameter | Description |
|---|---|
| Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
| Sets the number of partitions for the generated data. |
| Controls how many rows are generated per second. |
| Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example |
| Real-time mode requires update output mode. |
Step 3: Validate results
When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:
- A timestamp for when the row was generated by the rate source.
- A monotonically increasing counter that increments with each new row.
The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.
Additional resources
Now that you've run your first real-time query, explore these resources to build production streaming applications with Kafka, Kinesis, and other supported sources: