Observation

A class to observe named metrics on a DataFrame.

Metrics are aggregation expressions applied to the DataFrame while it is being processed by an action. An Observation instance collects the metrics while the first action is executed. Subsequent actions do not modify the metrics returned by Observation.get. Retrieval of the metric via Observation.get blocks until the first action has finished and metrics become available.

Syntax

Python
from pyspark.sql import Observation

observation = Observation(name=<name>)

Parameters

Parameter	Type	Description
`name`	str, optional	The name of the Observation and the metric. Defaults to a random UUID string.

Properties

Property	Description
`get`	Returns the observed metrics as a dictionary. Waits until the observed dataset finishes its first action. Only the result of the first action is available.

Notes

This class does not support streaming datasets.

The metrics columns must either contain a literal (for example, lit(42)), or must contain one or more aggregate functions (for example, sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input DataFrame's columns must always be wrapped in an aggregate function.

Examples

Python
from pyspark.sql.functions import col, count, lit, max
from pyspark.sql import Observation

df = spark.createDataFrame([["Alice", 2], ["Bob", 5]], ["name", "age"])
observation = Observation("my metrics")
observed_df = df.observe(observation, count(lit(1)).alias("count"), max(col("age")))
observed_df.count()

Output
2

Python
observation.get

Output
{'count': 2, 'max(age)': 5}

Syntax​

Parameters​

Properties​

Notes​

Examples​

Syntax

Parameters

Properties

Notes

Examples