Using sparklyr in Databricks R Notebooks(R)

Loading...

Use sparklyr in Databricks R notebooks

This notebook shows how to use sparklyr in Databricks notebooks.

Load sparklyr package

library(sparklyr)

Create a sparklyr connection

Use "databricks" as the connection method in spark_connect(). No additional parameters to spark_connect() are required. You do not need to call spark_install() as Spark is already installed on the Databricks cluster.

Note that sc is a special name for sparklyr connection. When you use that variable name, the notebook automatically displays Spark progress bars and built-in Spark UI viewers.

sc <- spark_connect(method = "databricks")

Use sparklyr and dplyr APIs

After setting up the sparklyr connection, you can use the sparklyr API. You can import and combine sparklyr with dplyr or MLlib.
If you use an extension package that includes third-party JARs, you may need to install those JARs as libraries in your workspace (AWS|Azure).

library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
iris_tbl %>% count

Aggregate and visualize data

# Change the default plot height 
options(repr.plot.height = 600)
iris_summary <- iris_tbl %>% 
  mutate(Sepal_Width = ROUND(Sepal_Width * 2) / 2) %>% # Bucketizing Sepal_Width
  group_by(Species, Sepal_Width) %>% 
  summarize(count = n(), Sepal_Length_Mean = mean(Sepal_Length), stdev = sd(Sepal_Length)) %>% collect
library(ggplot2)

ggplot(iris_summary, aes(Sepal_Width, Sepal_Length_Mean, color = Species)) + 
  geom_line(size = 1.2) +
  geom_errorbar(aes(ymin = Sepal_Length_Mean - stdev, ymax = Sepal_Length_Mean + stdev), width = 0.05) +
  geom_text(aes(label = count), vjust = -0.2, hjust = 1.2, color = "black") +
  theme(legend.position="top")