Skip to main content

Real-time mode reference

This page provides reference information for real-time mode in Structured Streaming, including supported environments, languages, sources, sinks, and operators. For known limitations, see Real-time mode limitations.

Supported languages

Real-time mode supports Scala, Java, and Python.

Compute types

Real-time mode supports the following compute types:

Compute type

Supported

Dedicated (formerly: single user)

Standard (formerly: shared)

✓ (only Python)

Lakeflow Spark Declarative Pipelines Classic

Not supported

Lakeflow Spark Declarative Pipelines Serverless

Not supported

Serverless

Not supported

Execution modes

Real-time mode supports update mode only:

Execution mode

Supported

Update mode

Append mode

Not supported

Complete mode

Not supported

Sources and sinks

Real-time mode supports the following sources and sinks:

Source or sink

As source

As sink

Apache Kafka

Event Hubs (using Kafka connector)

Kinesis

✓ (only EFO mode)

Not supported

AWS MSK

Not supported

Delta

Not supported

Not supported

Google Pub/Sub

Not supported

Not supported

Apache Pulsar

Not supported

Not supported

Arbitrary sinks (using forEachWriter)

Not applicable

Operators

Real-time mode supports most Structured Streaming operators:

Stateless operations

Operator

Supported

Selection

Projection

UDFs

Operator

Supported

Scala UDF

✓ (with some limitations)

Python UDF

✓ (with some limitations)

Aggregation

Operator

Supported

sum

count

max

min

avg

Aggregation functions

Windowing

Operator

Supported

Tumbling

Sliding

Session

Not supported

Deduplication

Operator

Supported

dropDuplicates

✓ (the state is unbounded)

dropDuplicatesWithinWatermark

Not supported

Stream to table join

Operator

Supported

Broadcast table join (table should be small)

Stream to stream join

Not supported

(flat)MapGroupsWithState

Not supported

transformWithState

✓ (with some differences)

union

✓ (with some limitations)

forEach

forEachBatch

Not supported

mapPartitions

Not supported (see limitation)

Special considerations

Some operators and features have specific considerations or differences when used in real-time mode.

transformWithState in real-time mode

For building custom stateful applications, Databricks supports transformWithState, an API in Apache Spark Structured Streaming. See Build a custom stateful application for more information about the API and code snippets.

However, there are some differences between how the API behaves in real-time mode and traditional streaming queries that leverage the micro-batch architecture.

  • Real-time mode calls the handleInputRows(key: String, inputRows: Iterator[T], timerValues: TimerValues) method for each row.
    • The inputRows iterator returns a single value. Micro-batch mode calls it once for each key, and the inputRows iterator returns all values for a key in the micro batch.
    • Account for this difference when writing your code
  • Event time timers are not supported in real-time mode.
  • In real-time mode, timers are delayed in firing depending on data arrival:
    • If a timer is scheduled for 10:00:00 but no data arrives, the timer doesn't fire immediately.
    • If data arrives at 10:00:10, the timer fires with a 10-second delay.
    • If no data arrives and the long-running batch is terminating, the timer fires before the batch terminates.

Python UDFs in real-time mode

Databricks supports the majority of Python user-defined functions (UDFs) in real-time mode:

Stateless

UDF type

Supported

Python scalar UDF (User-defined scalar functions - Python)

Arrow scalar UDF

Pandas scalar UDF (pandas user-defined functions)

Arrow function (mapInArrow)

Pandas function (Map)

Stateful grouping (UDAF)

UDF type

Supported

transformWithState (only Row interface)

applyInPandasWithState

Not supported

Non-stateful grouping (UDAF)

UDF type

Supported

apply

Not supported

applyInArrow

Not supported

applyInPandas

Not supported

Table functions

UDF type

Supported

UDTF (Python user-defined table functions (UDTFs))

Not supported

UC UDF

Not supported

There are several points to consider when using Python UDFs in real-time mode:

  • To minimize the latency, configure the Arrow batch size (spark.sql.execution.arrow.maxRecordsPerBatch) to 1.
    • Trade-off: This configuration optimizes for latency at the expense of throughput. For most workloads, this setting is recommended.
    • Increase the batch size only if a higher throughput is required to accommodate input volume, accepting the potential increase in latency.
  • Pandas UDFs and functions do not perform well with an Arrow batch size of 1.
    • If you use pandas UDFs or functions, set the Arrow batch size to a higher value (for example, 100 or higher).
    • This implies higher latency. Databricks recommends using an Arrow UDF or function if possible.
  • Due to the performance issue with pandas, transformWithState is only supported with the Row interface.