Skip to main content

Observation

A class to observe named metrics on a DataFrame.

Metrics are aggregation expressions applied to the DataFrame while it is being processed by an action. An Observation instance collects the metrics while the first action is executed. Subsequent actions do not modify the metrics returned by Observation.get. Retrieval of the metric via Observation.get blocks until the first action has finished and metrics become available.

Syntax

Python
from pyspark.sql import Observation

observation = Observation(name=<name>)

Parameters

Parameter

Type

Description

name

str, optional

The name of the Observation and the metric. Defaults to a random UUID string.

Properties

Property

Description

get

Returns the observed metrics as a dictionary. Waits until the observed dataset finishes its first action. Only the result of the first action is available.

Notes

This class does not support streaming datasets.

The metrics columns must either contain a literal (for example, lit(42)), or must contain one or more aggregate functions (for example, sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input DataFrame's columns must always be wrapped in an aggregate function.

Examples

Python
from pyspark.sql.functions import col, count, lit, max
from pyspark.sql import Observation

df = spark.createDataFrame([["Alice", 2], ["Bob", 5]], ["name", "age"])
observation = Observation("my metrics")
observed_df = df.observe(observation, count(lit(1)).alias("count"), max(col("age")))
observed_df.count()
Output
2
Python
observation.get
Output
{'count': 2, 'max(age)': 5}