Skip to main content

Migrate from SparkR to sparklyr

SparkR was developed as part of Apache Spark, and its design is familiar to users of Scala and Python, but potentially less intuitive for R practitioners. In addition, SparkR is deprecated in Spark 4.0.

In contrast, sparklyr is focused on providing a more R-friendly experience. It leverages dplyr syntax, which is familiar to users of tidyverse with patterns like select(), filter(), and mutate() for DataFrame operations.

sparklyr is the recommended R package for working with Apache Spark. This page explains differences between SparkR and sparklyr across Spark APIs, and provides information about code migration.

Environment setup

Installation

If you are in the Databricks workspace, no installation is required. Load sparklyr with library(sparklyr). To install sparklyr locally outside of Databricks, see Get Started.

Connecting to Spark

Connect to Spark with sparklyr in the Databricks workspace or locally using Databricks Connect:

Workspace:

R
library(sparklyr)
sc <- spark_connect(method = "databricks")

Databricks Connect:

R
sc <- spark_connect(method = "databricks_connect")

For more details and an extended tutorial on Databricks Connect with sparklyr, see Getting Started.

Reading and writing data

sparklyr has a family of spark_read_*() and spark_write_*() functions to load and save data, unlike SparkR's generic read.df() and write.df() functions. There are also unique functions to create Spark DataFrames or Spark SQL temporary views from R data frames in memory.

Task

SparkR

sparklyr

Copy data to Spark

createDataFrame()

copy_to()

Create temporary view

createOrReplaceTempView()

Use invoke() with method directly

Write data to table

saveAsTable()

spark_write_table()

Write data to a specified format

write.df()

spark_write_<format>()

Read data from table

tableToDF()

tbl() or spark_read_table()

Read data from a specified format

read.df()

spark_read_<format>()

Loading data

To convert an R data frame to a Spark DataFrame, or to create a temporary view out of a DataFrame to apply SQL to it:

R
mtcars_df <- createDataFrame(mtcars)

copy_to() creates a temporary view using the specified name. You can use name to reference data if you are using SQL directly (for example, sdf_sql()). Also, copy_to() caches data by setting the memory parameter to TRUE.

Creating views

The following code examples show how temporary views are created:

R
createOrReplaceTempView(mtcars_df, "mtcars_tmp_view")

Writing data

The following code examples show how data is written:

R
# Save a DataFrame to Unity Catalog
saveAsTable(
mtcars_df,
tableName = "<catalog>.<schema>.<table>",
mode = "overwrite"
)

# Save a DataFrame to local filesystem using Delta format
write.df(
mtcars_df,
path = "file:/<path/to/save/delta/mtcars>",
source = "delta",
mode = "overwrite"
)

Reading data

The following code examples show how data is read:

R
# Load a Unity Catalog table as a DataFrame
tableToDF("<catalog>.<schema>.<table>")

# Load csv file into a DataFrame
read.df(
path = "file:/<path/to/read/csv/data.csv>",
source = "csv",
header = TRUE,
inferSchema = TRUE
)

# Load Delta from local filesystem as a DataFrame
read.df(
path = "file:/<path/to/read/delta/mtcars>",
source = "delta"
)

# Load data from a table using SQL - Databricks recommendeds using `tableToDF`
sql("SELECT * FROM <catalog>.<schema>.<table>")

Processing data

Select and filter

R
# Select specific columns
select(mtcars_df, "mpg", "cyl", "hp")

# Filter rows where mpg > 20
filter(mtcars_df, mtcars_df$mpg > 20)

Add columns

R
# Add a new column 'power_to_weight' (hp divided by wt)
withColumn(mtcars_df, "power_to_weight", mtcars_df$hp / mtcars_df$wt)

Grouping and aggregation

R
# Calculate average mpg and hp by number of cylinders
mtcars_df |>
groupBy("cyl") |>
summarize(
avg_mpg = avg(mtcars_df$mpg),
avg_hp = avg(mtcars_df$hp)
)

Joins

Suppose we have another dataset with cylinder labels that we want to join to mtcars.

R
# Create another DataFrame with cylinder labels
cylinders <- data.frame(
cyl = c(4, 6, 8),
cyl_label = c("Four", "Six", "Eight")
)
cylinders_df <- createDataFrame(cylinders)

# Join mtcars_df with cylinders_df
join(
x = mtcars_df,
y = cylinders_df,
mtcars_df$cyl == cylinders_df$cyl,
joinType = "inner"
)

User defined functions (UDFs)

To create a custom function for categorization:

R
# Define the custom function
categorize_hp <- function(df)
df$hp_category <- ifelse(df$hp > 150, "High", "Low") # a real-world example would use case_when() with mutate()
df

SparkR requires defining the output schema explicitly before applying a function:

R
# Define the schema for the output DataFrame
schema <- structType(
structField("mpg", "double"),
structField("cyl", "double"),
structField("disp", "double"),
structField("hp", "double"),
structField("drat", "double"),
structField("wt", "double"),
structField("qsec", "double"),
structField("vs", "double"),
structField("am", "double"),
structField("gear", "double"),
structField("carb", "double"),
structField("hp_category", "string")
)

# Apply the function across partitions
dapply(
mtcars_df,
func = categorize_hp,
schema = schema
)

# Apply the same function to each group of a DataFrame. Note that the schema is still required.
gapply(
mtcars_df,
cols = "hp",
func = categorize_hp,
schema = schema
)

spark.lapply() vs spark_apply()

In SparkR, spark.lapply() operates on R lists rather than DataFrames. There's no direct equivalent in sparklyr, but you can achieve similar behavior with spark_apply() by working with a DataFrame that includes unique identifiers and grouping by those IDs. In some cases, row-wise operations can also provide comparable functionality. For more information about spark_apply(), see Distributing R Computations.

R
# Define a list of integers
numbers <- list(1, 2, 3, 4, 5)

# Define a function to apply
square <- function(x)
x * x

# Apply the function over list using Spark
spark.lapply(numbers, square)

Machine learning

Full SparkR and sparklyr examples for machine learning are in the Spark ML Guide and sparklyr reference.

note

If you are not using Spark MLlib, Databricks recommends using UDFs to train with the library of your choice (for example xgboost).

Linear regression

R
# Select features
training_df <- select(mtcars_df, "mpg", "hp", "wt")

# Fit the model using Generalized Linear Model (GLM)
linear_model <- spark.glm(training_df, mpg ~ hp + wt, family = "gaussian")

# View model summary
summary(linear_model)

K-means clustering

R
# Apply KMeans clustering with 3 clusters using mpg and hp as features
kmeans_model <- spark.kmeans(mtcars_df, mpg ~ hp, k = 3)

# Get cluster predictions
predict(kmeans_model, mtcars_df)

Performance and optimization

Collecting

Both SparkR and sparklyr use collect() to convert Spark DataFrames to R data frames. Only collect small amounts of data back to R data frames, or the Spark driver will run out of memory.

To prevent out of memory errors, SparkR has built-in optimizations in Databricks Runtime that help collect data or execute user-defined functions.

To ensure optimal performance with sparklyr for collecting data and UDFs on Databricks Runtime versions below 14.3 LTS, load the arrow package:

R
library(arrow)

In-memory partitioning

R
# Repartition the SparkDataFrame based on 'cyl' column
repartition(mtcars_df, col = mtcars_df$cyl)

# Repartition the SparkDataFrame to number of partitions
repartition(mtcars_df, numPartitions = 10)

# Coalesce the DataFrame to number of partitions
coalesce(mtcars_df, numPartitions = 1)

# Get number of partitions
getNumPartitions(mtcars_df)

Caching

R
# Cache the DataFrame in memory
cache(mtcars_df)