Real-time mode reference
This page provides reference information for real-time mode in Structured Streaming, including supported environments, languages, sources, sinks, and operators. For known limitations, see Real-time mode limitations.
Supported languages
Real-time mode supports Scala, Java, and Python.
Compute types
Real-time mode supports the following compute types:
Compute type | Supported |
|---|---|
Dedicated (formerly: single user) | ✓ |
Standard (formerly: shared) | ✓ (only Python) |
Lakeflow Spark Declarative Pipelines Classic | Not supported |
Lakeflow Spark Declarative Pipelines Serverless | Not supported |
Serverless | Not supported |
Execution modes
Real-time mode supports update mode only:
Execution mode | Supported |
|---|---|
Update mode | ✓ |
Append mode | Not supported |
Complete mode | Not supported |
Sources and sinks
Real-time mode supports the following sources and sinks:
Source or sink | As source | As sink |
|---|---|---|
Apache Kafka | ✓ | ✓ |
Event Hubs (using Kafka connector) | ✓ | ✓ |
Kinesis | ✓ (only EFO mode) | Not supported |
AWS MSK | ✓ | Not supported |
Delta | Not supported | Not supported |
Google Pub/Sub | Not supported | Not supported |
Apache Pulsar | Not supported | Not supported |
Arbitrary sinks (using | Not applicable | ✓ |
Operators
Real-time mode supports most Structured Streaming operators:
Stateless operations
Operator | Supported |
|---|---|
Selection | ✓ |
Projection | ✓ |
UDFs
Operator | Supported |
|---|---|
Scala UDF | |
Python UDF |
Aggregation
Operator | Supported |
|---|---|
sum | ✓ |
count | ✓ |
max | ✓ |
min | ✓ |
avg | ✓ |
✓ |
Windowing
Operator | Supported |
|---|---|
Tumbling | ✓ |
Sliding | ✓ |
Session | Not supported |
Deduplication
Operator | Supported |
|---|---|
dropDuplicates | ✓ (the state is unbounded) |
dropDuplicatesWithinWatermark | Not supported |
Stream to table join
Operator | Supported |
|---|---|
Broadcast table join (table should be small) | ✓ |
Stream to stream join | Not supported |
(flat)MapGroupsWithState | Not supported |
transformWithState | |
union | |
forEach | ✓ |
forEachBatch | Not supported |
mapPartitions | Not supported (see limitation) |
Special considerations
Some operators and features have specific considerations or differences when used in real-time mode.
transformWithState in real-time mode
For building custom stateful applications, Databricks supports transformWithState, an API in Apache Spark Structured Streaming. See Build a custom stateful application for more information about the API and code snippets.
However, there are some differences between how the API behaves in real-time mode and traditional streaming queries that leverage the micro-batch architecture.
- Real-time mode calls the
handleInputRows(key: String, inputRows: Iterator[T], timerValues: TimerValues)method for each row.- The
inputRowsiterator returns a single value. Micro-batch mode calls it once for each key, and theinputRowsiterator returns all values for a key in the micro batch. - Account for this difference when writing your code
- The
- Event time timers are not supported in real-time mode.
- In real-time mode, timers are delayed in firing depending on data arrival:
- If a timer is scheduled for 10:00:00 but no data arrives, the timer doesn't fire immediately.
- If data arrives at 10:00:10, the timer fires with a 10-second delay.
- If no data arrives and the long-running batch is terminating, the timer fires before the batch terminates.
Python UDFs in real-time mode
Databricks supports the majority of Python user-defined functions (UDFs) in real-time mode:
Stateless
UDF type | Supported |
|---|---|
Python scalar UDF (User-defined scalar functions - Python) | ✓ |
Arrow scalar UDF | ✓ |
Pandas scalar UDF (pandas user-defined functions) | ✓ |
Arrow function ( | ✓ |
Pandas function (Map) | ✓ |
Stateful grouping (UDAF)
UDF type | Supported |
|---|---|
| ✓ |
| Not supported |
Non-stateful grouping (UDAF)
UDF type | Supported |
|---|---|
| Not supported |
| Not supported |
| Not supported |
Table functions
UDF type | Supported |
|---|---|
Not supported | |
UC UDF | Not supported |
There are several points to consider when using Python UDFs in real-time mode:
- To minimize the latency, configure the Arrow batch size (
spark.sql.execution.arrow.maxRecordsPerBatch) to 1.- Trade-off: This configuration optimizes for latency at the expense of throughput. For most workloads, this setting is recommended.
- Increase the batch size only if a higher throughput is required to accommodate input volume, accepting the potential increase in latency.
- Pandas UDFs and functions do not perform well with an Arrow batch size of 1.
- If you use pandas UDFs or functions, set the Arrow batch size to a higher value (for example, 100 or higher).
- This implies higher latency. Databricks recommends using an Arrow UDF or function if possible.
- Due to the performance issue with pandas, transformWithState is only supported with the
Rowinterface.