Skip to main content

kde

Generates a Kernel Density Estimate (KDE) plot using Gaussian kernels.

In statistics, kernel density estimation is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.

Syntax

kde(bw_method, column=None, ind=None, **kwargs)

Parameters

Parameter

Type

Description

bw_method

int or float

The method used to calculate the estimator bandwidth. See KernelDensity in PySpark for more information.

column

str or list of str, optional

Column name or list of names to use for creating the KDE plot. If None (default), all numeric columns are used.

ind

list of float, NumPy array, or int, optional

Evaluation points for the estimated PDF. If None (default), 1000 equally spaced points are used. If a NumPy array, the KDE is evaluated at those points. If an integer, that many equally spaced points are used.

**kwargs

optional

Additional keyword arguments.

Returns

plotly.graph_objs.Figure

Examples

Python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(5.1, 3.5, 0), (4.9, 3.0, 0), (7.0, 3.2, 1), (6.4, 3.2, 1), (5.9, 3.0, 2)]
columns = ["length", "width", "species"]
df = spark.createDataFrame(data, columns)
df.plot.kde(bw_method=0.3, ind=100)