Get started with real-time mode
This feature is in Public Preview.
Real-time mode enables ultra-low latency streaming with end-to-end latency as low as five milliseconds, making it ideal for operational workloads like fraud detection and real-time personalization. This tutorial guides you through setting up your first real-time streaming query using a simple example.
For conceptual information about real-time mode, when to use it, and supported features, see Real-time mode in Structured Streaming.
Requirements
- You have permission to create classic compute.
- Databricks Runtime 17.1 or above (required for using the
displayfunction with real-time mode).
If you don't have classic compute creation privileges, contact your workspace administrator to create a real-time mode cluster for you using the configuration in Step 1.
Step 1: Create classic compute for real-time mode
Real-time mode requires a specific classic compute configuration to achieve ultra-low latency. These settings ensure that tasks run simultaneously across all stages and data is processed continuously as it arrives, rather than in batches.
To create a properly configured classic compute:
-
In your Databricks workspace, click Compute in the sidebar.
-
Click Create compute.
-
Enter a name.
-
Select Databricks Runtime 17.1 or above.
-
Clear Photon acceleration (real-time mode doesn't support Photon).
-
Clear Enable autoscaling (real-time mode requires a fixed cluster size).
-
Under Advanced performance, clear Use spot instances (spot instances can cause interruptions).
-
Click Advanced options to expand additional settings.
-
Under Access mode, select Dedicated (formerly: Single user).
-
Under Spark config, add the following configuration:
Textspark.databricks.streaming.realTimeMode.enabled true -
Click Create compute.
Step 2: Create a notebook
Notebooks provide an interactive environment for developing and testing streaming queries. You use this notebook to write your real-time query and see the results update continuously.
To create a notebook:
- Click New in the sidebar, then click Notebook.
- In the compute drop-down menu, select the compute you created in Step 1.
- Select Python or Scala as the default language.
Step 3: Run a real-time mode query
Copy and paste the following code into a notebook cell and run it. This example uses a rate source, which generates rows at a specified rate, and displays the results in real time.
The display function with realTime trigger is available in Databricks Runtime 17.1 and above.
- Python
- Scala
inputDF = (
spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
)
display(inputDF, realTime="5 minutes", outputMode="update")
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.streaming.OutputMode
val inputDF = spark
.readStream
.format("rate")
.option("numPartitions", 2)
.option("rowsPerSecond", 1)
.load()
display(inputDF, trigger=Trigger.RealTime(), outputMode=OutputMode.Update())
After running the code, you see a table that updates in real time as new rows are generated. The table displays a timestamp column and a value column that increments with each row.
Understanding the code
The code above demonstrates the essential components of a real-time streaming query. The following tables explain the key parameters and what they control:
- Python
- Scala
Parameter | Description |
|---|---|
| Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
| Sets the number of partitions for the generated data. |
| Controls how many rows are generated per second. |
| Enables real-time mode. The interval specifies how often the query checkpoints progress. Longer intervals mean less frequent checkpointing but potentially longer recovery times after failures. |
| Real-time mode requires update output mode. |
Parameter | Description |
|---|---|
| Uses the rate source, a built-in source that generates rows at a configurable rate. This is useful for testing without external dependencies. |
| Sets the number of partitions for the generated data. |
| Controls how many rows are generated per second. |
| Enables real-time mode with the default checkpoint interval. You can also specify an interval, for example |
| Real-time mode requires update output mode. |
What you're seeing
When you run the query, the display function creates a table that updates in real time as the rate source generates new rows. Each row contains:
- timestamp: The time when the row was generated by the rate source
- value: A monotonically increasing counter that increments with each new row
The table updates continuously with minimal latency, demonstrating how real-time mode processes data as soon as it becomes available. This is the core benefit of real-time mode - the ability to see and act on data immediately rather than waiting for batch processing.
What you've learned
You've successfully set up and run your first real-time streaming query. You now know how to:
- Configure classic compute with the required settings for real-time mode (dedicated cluster, Photon disabled, autoscaling disabled, Spark config)
- Enable real-time processing using the
realTimetrigger - Use the
displayfunction for interactive development and testing - Verify that your query is running in real-time mode by observing continuous updates
You're ready to build production real-time pipelines with Kafka, Kinesis, and other supported sources. To learn more about Structured Streaming, see Structured Streaming concepts.
Next steps
Now that you've run your first real-time query, explore these resources to build production streaming applications:
- Real-time mode examples - Working code examples for Kafka sources and sinks, stateful queries, aggregations, and custom sinks
- Real-time mode reference - Learn about cluster sizing, supported operators, monitoring, and feature limitations
- Stateful streaming applications - Add state management to your streaming queries for deduplication, aggregations, and windowing
- Advanced state management - Use
transformWithStatefor custom stateful processing with time-to-live (TTL) and complex logic