latestOffset
Returns the most recent offset available given a read limit.
The start offset can be used to determine how much new data should be read given the limit. For the very first microbatch, start is provided from the return value of initialOffset(). For subsequent microbatches, it continues from the last microbatch. The source can return the same offset as the start offset if there is no data to process.
ReadLimit can be used by the source to limit the amount of data returned. Implement getDefaultReadLimit() to provide the proper ReadLimit if the source can limit data based on source options.
The engine can still call latestOffset() with ReadAllAvailable even if the source produces a different read limit from getDefaultReadLimit(). The source must always respect the given ReadLimit provided by the engine.
Added in Databricks Runtime 15.2
Syntax
latestOffset(start: dict, limit: ReadLimit)
Parameters
Parameter | Type | Description |
|---|---|---|
| dict | The start offset of the microbatch to continue reading from. |
| ReadLimit | The limit on the amount of data to be returned by this call. |
Returns
dict
A dict or recursive dict whose key and value are primitive types, which includes Integer, String, and Boolean.
Examples
from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadMaxRows
def latestOffset(self, start, limit):
# Assume the source has 10 new records between start and latest offset
if isinstance(limit, ReadAllAvailable):
return {"index": start["index"] + 10}
else: # e.g., limit is ReadMaxRows(5)
return {"index": start["index"] + min(10, limit.maxRows)}