Observation
A class to observe named metrics on a DataFrame.
Metrics are aggregation expressions applied to the DataFrame while it is being processed by an action. An Observation instance collects the metrics while the first action is executed. Subsequent actions do not modify the metrics returned by Observation.get. Retrieval of the metric via Observation.get blocks until the first action has finished and metrics become available.
Syntax
from pyspark.sql import Observation
observation = Observation(name=<name>)
Parameters
Parameter | Type | Description |
|---|---|---|
| str, optional | The name of the Observation and the metric. Defaults to a random UUID string. |
Properties
Property | Description |
|---|---|
Returns the observed metrics as a dictionary. Waits until the observed dataset finishes its first action. Only the result of the first action is available. |
Notes
This class does not support streaming datasets.
The metrics columns must either contain a literal (for example, lit(42)), or must contain one or more aggregate functions (for example, sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input DataFrame's columns must always be wrapped in an aggregate function.
Examples
from pyspark.sql.functions import col, count, lit, max
from pyspark.sql import Observation
df = spark.createDataFrame([["Alice", 2], ["Bob", 5]], ["name", "age"])
observation = Observation("my metrics")
observed_df = df.observe(observation, count(lit(1)).alias("count"), max(col("age")))
observed_df.count()
2
observation.get
{'count': 2, 'max(age)': 5}