latestOffset

Returns the most recent offset available given a read limit.

The start offset can be used to determine how much new data should be read given the limit. For the very first microbatch, start is provided from the return value of initialOffset(). For subsequent microbatches, it continues from the last microbatch. The source can return the same offset as the start offset if there is no data to process.

ReadLimit can be used by the source to limit the amount of data returned. Implement getDefaultReadLimit() to provide the proper ReadLimit if the source can limit data based on source options.

The engine can still call latestOffset() with ReadAllAvailable even if the source produces a different read limit from getDefaultReadLimit(). The source must always respect the given ReadLimit provided by the engine.

Added in Databricks Runtime 15.2

Syntax

latestOffset(start: dict, limit: ReadLimit)

Parameters

Parameter	Type	Description
`start`	dict	The start offset of the microbatch to continue reading from.
`limit`	ReadLimit	The limit on the amount of data to be returned by this call.

Returns

dict

A dict or recursive dict whose key and value are primitive types, which includes Integer, String, and Boolean.

Examples

Python
from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadMaxRows

def latestOffset(self, start, limit):
    # Assume the source has 10 new records between start and latest offset
    if isinstance(limit, ReadAllAvailable):
        return {"index": start["index"] + 10}
    else:  # e.g., limit is ReadMaxRows(5)
        return {"index": start["index"] + min(10, limit.maxRows)}

Syntax​

Parameters​

Returns​

Examples​

Syntax

Parameters

Returns

Examples