Databricks for R developers
This section provides a guide to developing notebooks and jobs in Databricks using the R language.
A basic workflow for getting started is:
Import code: Either import your own code from files or Git repos, or try a tutorial listed below. Databricks recommends learning to use interactive Databricks notebooks.
Run your code on a cluster: Either create a cluster of your own, or ensure you have permissions to use a shared cluster. Attach your notebook to the cluster, and run the notebook.
Beyond this, you can branch out into more specific topics:
Work with larger data sets using Apache Spark
Automate your workload as a job
Use machine learning to analyze your data
Tutorials
The following tutorials provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Reference
The following subsections list key features and tips to help you begin developing in Databricks with R.
Databricks supports two APIs that provide an R interface to Apache Spark: SparkR and sparklyr.
SparkR
These articles provide an introduction and reference for SparkR. SparkR is an R interface to Apache Spark that provides a distributed data frame implementation. SparkR supports operations like selection, filtering, and aggregation (similar to R data frames) but on large datasets.
sparklyr
This article provides an introduction to sparklyr. sparklyr is an R interface to Apache Spark that provides functionality similar to dplyr, broom
, and DBI.
Comparing SparkR and sparklyr
This article explains key similarities and differences between SparkR and sparklyr.
Work with DataFrames and tables with SparkR and sparklyr
This article describes how to use R, SparkR, sparklyr, and dplyr to work with R data.frames, Spark DataFrames, and Spark tables in Databricks.
Manage code with notebooks and Databricks Git folders
Databricks notebooks support R. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
Databricks Git folders allows users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Clusters
Databricks Compute provide compute management for both single nodes and large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists will generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use single node compute for cost savings.
For detailed tips, see Compute configuration best practices.
Administrators can set up cluster policies to simplify and guide cluster creation.
Single node R and distributed R
Databricks clusters consist of an Apache Spark driver node and zero or more Spark worker (also known as executor) nodes. The driver node maintains attached notebook state, maintains the SparkContext
, interprets notebook and library commands, and runs the Spark master that coordinates with Spark executors. Worker nodes run the Spark executors, one Spark executor per worker node.
A single node cluster has one driver node and no worker nodes, with Spark running in local mode to support access to tables managed by Databricks. Single node clusters support RStudio, notebooks, and libraries, and are useful for R projects that don’t depend on Spark for big data or parallel processing. See Single-node or multi-node compute.
For data sizes that R struggles to process (many gigabytes or petabytes), you should use multiple-node or distributed clusters instead. Distributed clusters have one driver node and one or more worker nodes. Distributed clusters support not only RStudio, notebooks, and libraries, but R packages such as SparkR and sparkly, which are uniquely designed to use distributed clusters through the SparkContext
. These packages provide familiar SQL and DataFrame APIs, which enable assigning and running various Spark tasks and commands in parallel across worker nodes. To learn more about sparklyr and SparkR, see Comparing SparkR and sparklyr.
Some SparkR and sparklyr functions that take particular advantage of distributing related work across worker nodes include the following:
sparklyr::spark_apply: Runs arbitrary R code at scale within a cluster. This is especially useful for using functionality that is available only in R, or R packages that are not available in Apache Spark nor other Spark packages.
SparkR::dapply: Applies the specified function to each partition of a
SparkDataFrame
.SparkR::dapplyCollect: Applies the specified function to each partition of a
SparkDataFrame
and collects the results back to R as adata.frame
.SparkR::gapply: Groups a
SparkDataFrame
by using the specified columns and applies the specified R function to each group.SparkR::gapplyCollect: Groups a
SparkDataFrame
by using the specified columns, applies the specified R function to each group, and collects the result back to R as adata.frame
.SparkR::spark.lapply: Runs the specified function over a list of elements, distributing the computations with Spark.
For examples, see the notebook Distributed R: User Defined Functions in Spark.
Databricks Container Services
Databricks Container Services lets you specify a Docker image when you create a cluster. Databricks provides the databricksruntime/rbase base image on Docker Hub as an example to launch a Databricks Container Services cluster with R support. See also the Dockerfile that is used to generate this base image.
Libraries
Databricks clusters use the Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom R packages into libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime release notes versions and compatibility. Use Databricks Runtime for Machine Learning for machine learning workloads. For full lists of pre-installed libraries, see the “Installed R libraries” section for the target Databricks Runtime in Databricks Runtime release notes versions and compatibility.
You can customize your environment by using Notebook-scoped R libraries, which allow you to modify your notebook or job environment with libraries from CRAN or other repositories. To do this, you can use the familiar install.packages function from utils
. The following example installs the Arrow R package from the default CRAN repository:
install.packages("arrow")
If you need an older version than what is included in the Databricks Runtime, you can use a notebook to run install_version function from devtools
. The following example installs dplyr version 0.7.4 from CRAN:
require(devtools)
install_version(
package = "dplyr",
version = "0.7.4",
repos = "http://cran.r-project.org"
)
Packages installed this way are available across a cluster. They are scoped to the user who installs them. This enables you to install multiple versions of the same package on the same compute without creating package conflicts.
You can install other libraries as Cluster libraries as needed, for example from CRAN. To do this, in the cluster user interface, click Libraries > Install new > CRAN and specify the library’s name. This approach is especially important for when you want to call user-defined functions with SparkR or sparklyr.
For more details, see Libraries.
To install a custom package into a library:
Build your custom package from the command line or by using RStudio.
Copy the custom package file from your development machine over to your Databricks workspace. For options, see Libraries.
Install the custom package into a library by running
install.packages
.For example, from a notebook in your workspace:
install.packages( pkgs = "/path/to/tar/file/<custom-package>.tar.gz", type = "source", repos = NULL )
Or:
%sh R CMD INSTALL /path/to/tar/file/<custom-package>.tar.gz
After you install a custom package into a library, add the library to the search path and then load the library with a single command.
For example:
# Add the library to the search path one time.
.libPaths(c("/path/to/tar/file/", .libPaths()))
# Load the library. You do not need to add the library to the search path again.
library(<custom-package>)
To install a custom package as a library on each node in a cluster, you must use What are init scripts?.
Visualizations
Databricks R notebooks support various types of visualizations using the display
function.
Jobs
You can automate R workloads as scheduled or triggered notebook Create and run Databricks Jobs in Databricks.
For details on creating a job via the UI, see Create a job.
The Jobs API allows you to create, edit, and delete jobs.
The Databricks CLI provides a convenient command line interface for calling the Jobs API.
Machine learning
Databricks supports a wide variety of machine learning (ML) workloads, including traditional ML on tabular data, deep learning for computer vision and natural language processing, recommendation systems, graph analytics, and more. For general information about machine learning on Databricks, see Databricks Runtime for Machine Learning.
For ML algorithms, you can use pre-installed libraries in Databricks Runtime for Machine Learning. You can also install custom libraries.
For machine learning operations (MLOps), Databricks provides a managed service for the open source library MLflow. With MLflow Tracking you can record model development and save models in reusable formats. You can use the MLflow Model Registry to manage and automate the promotion of models towards production. Jobs and Model Serving allow hosting models as batch and streaming jobs as REST endpoints. For more information and examples, see the ML lifecycle management using MLflow or the MLflow R API docs.
R developer tools
In addition to Databricks notebooks, you can also use the following R developer tools:
Use SparkR and RStudio Desktop with Databricks Connect.
Use sparklyr and RStudio Desktop with Databricks Connect.
R session customization
In Databricks Runtime 12.2 LTS and above, R sessions can be customized by using site-wide profile (.Rprofile
) files. R notebooks will source the file as R code during startup. To modify the file, find the value of R_HOME
and modify $R_HOME/etc/Rprofile.site
. Note that Databricks has added configuration in the file to ensure proper functionality for hosted RStudio on Databricks. Removing any of it may cause RStudio to not work as expected.
In Databricks Runtime 11.3 LTS and below, this behavior can be enabled by setting the environment variable DATABRICKS_ENABLE_RPROFILE=true
.