DataFrameStatFunctionsクラス

DataFrame を使用した統計関数の機能。

Spark Connectをサポート

構文

Python
DataFrame.stat

方法

手法	説明
`approxQuantile(col, probabilities, relativeError)`	DataFrame の数値列のおおよその分位数を計算します。
`corr(col1, col2, method)`	2 つの列の相関を double 値として計算します。現在はピアソン相関係数のみサポートされています。
`cov(col1, col2)`	指定された列の標本共分散を double 値として計算します。
`crosstab(col1, col2)`	指定された列のペアごとの度数テーブルを計算します。
`freqItems(cols, support)`	列に頻繁に出現する項目を検索します。誤検出の可能性も考えられます。
`sampleBy(col, fractions, seed)`	各層に与えられた割合に基づいて、層別サンプルを非置換で返します。

手法	説明
`approxQuantile(col, probabilities, relativeError)`	DataFrame の数値列のおおよその分位数を計算します。
`corr(col1, col2, method)`	2 つの列の相関を double 値として計算します。現在はピアソン相関係数のみサポートされています。
`cov(col1, col2)`	指定された列の標本共分散を double 値として計算します。
`crosstab(col1, col2)`	指定された列のペアごとの度数テーブルを計算します。
`freqItems(cols, support)`	列に頻繁に出現する項目を検索します。誤検出の可能性も考えられます。
`sampleBy(col, fractions, seed)`	各層に与えられた割合に基づいて、層別サンプルを非置換で返します。

例

近似四分位数

Python
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["values"])
df.stat.approxQuantile("values", [0.0, 0.5, 1.0], 0.05)

Output
[1.0, 3.0, 5.0]

相関

Python
df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"])
df.stat.corr("c1", "c2")

Output
-0.3592106040535498

共分散

Python
df = spark.createDataFrame([(1, 12), (10, 1), (19, 8)], ["c1", "c2"])
df.stat.cov("c1", "c2")

Output
-18.0

クロス集計

Python
df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df.stat.crosstab("c1", "c2").sort("c1_c2").show()

Output
+-----+---+---+---+
|c1_c2| 10| 11|  8|
+-----+---+---+---+
|    1|  0|  2|  0|
|    3|  1|  0|  0|
|    4|  0|  0|  2|
+-----+---+---+---+

よく使うアイテム

Python
from pyspark.sql import functions as sf

df = spark.createDataFrame([(1, 11), (1, 11), (3, 10), (4, 8), (4, 8)], ["c1", "c2"])
df2 = df.stat.freqItems(["c1", "c2"])
df2.select([sf.sort_array(c).alias(c) for c in df2.columns]).show()

Output
+------------+------------+
|c1_freqItems|c2_freqItems|
+------------+------------+
|   [1, 3, 4]| [8, 10, 11]|
+------------+------------+

層別サンプル

Python
from pyspark.sql import functions as sf

dataset = spark.range(0, 100, 1, 5).select((sf.col("id") % 3).alias("key"))
dataset.stat.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0).groupBy("key").count().orderBy("key").show()

Output
+---+-----+
|key|count|
+---+-----+
|  0|    4|
|  1|    9|
+---+-----+

構文​

方法​

例​

近似四分位数​

相関​

共分散​

クロス集計​

よく使うアイテム​

層別サンプル​

構文

方法

例

近似四分位数

相関

共分散

クロス集計

よく使うアイテム

層別サンプル