crosstab

crosstab computes a pair-wise frequency table of the given columns, also known as a contingency table.

Note:

The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Pairs that have no occurrences will have zero as their counts.

Syntax:

  • crosstab(DataFrame, col1, col2)

Parameters:

  • DataFrame: Any SparkR DataFrame
  • col1: String, any column in DataFrame
  • col2: String, any column in DataFrame

Output:

  • Local R Data Frame
# Create SparkR DataFrame
df <- createDataFrame(sqlContext, mtcars)
head(df)

SparkR’s crosstab is similar to the table function in base R. In SparkR, the table function has been overwritten to convert an existing Spark SQL Table into a DataFrame, and will not return contingency tables as you might have expected.

# Create contingency table with df$cyl and df$gear
# Note that a local R data frame is returned
crosstab(df, "cyl", "gear")