RStudio on Databricks

Databricks integrates with RStudio Server, the popular integrated development environment (IDE) for R.

You can use either the Open Source or Pro editions of RStudio Server on Databricks. If you want to use RStudio Server Pro, you must transfer your existing RStudio pro license to Databricks (see Get started with RStudio Server Pro).

Note

RStudio integration requires the Databricks Operational Security Package.

RStudio integration architecture

When you use RStudio Server on Databricks, the RStudio Server Daemon runs on the driver (or master) node of a Databricks high concurrency cluster. The RStudio web UI is proxied through Databricks webapp, which means that you do not need to make any changes to your cluster network configuration. This diagram demonstrates the RStudio integration component architecture.

Architecture of RStudio on |Databricks|

Warning

Databricks proxies the RStudio web service from port 8787 on the clusters’ Spark driver. This web proxy is intended for use only with RStudio. If you launch other web services on port 8787, you might expose your users to potential security exploits. Databricks is not responsible for any issues that result from the installation of unsupported software on a cluster.

Requirements

Get started with RStudio Server Open Source

To get started with RStudio Server Open Source on Databricks, you must install RStudio on a high concurrency cluster. You need to perform this installation only once. Installation is usually performed by an administrator.

Install RStudio Server Open Source

To set up RStudio Server Open Source on a Databricks cluster, you must create an init script to install the RStudio Server Open Source binary package. See Cluster Node Initialization Scripts for more details. Here is an example notebook cell that installs an init script for a cluster named <cluster-name>.

%python
script = """
  sudo apt-get update
  sudo apt-get install -y gdebi-core alien
  cd /tmp
  sudo wget https://download2.rstudio.org/rstudio-server-1.1.453-amd64.deb
  sudo gdebi -n rstudio-server-1.1.453-amd64.deb
  sudo rstudio-server restart
"""

dbutils.fs.put("/databricks/init/<cluster-name>/rstudio-install.sh", script, True)
  1. Update <cluster-name> to the name of your cluster.
  2. Run the code in a notebook to install the script.
  3. Restart the cluster.

Use RStudio Server Open Source

  1. Display the cluster details of the cluster on which you installed RStudio and click the Apps tab:

    Cluster Apps Tab
  2. In the Apps tab, click the Set up RStudio button. This generates a one-time password for you. Click the show link to display it and copy the password.

    RStudio One-time Password
  3. Click the Open RStudio UI link to open the UI in a new tab. Enter your username and password in the login form and sign in.

    RStudio Login Form
  4. From the RStudio UI, you can import the SparkR package and set up a SparkR session to launch Spark jobs on your high concurrency cluster.

    library(SparkR)
    sparkR.session()
    
    RStudio Session
  5. You can also attach the sparklyr package and set up a Spark connection.

    library(sparklyr)
    sparkR.session()
    sc <- spark_connect(method = "databricks")
    
    RStudio Session with sparklyr

Get started with RStudio Server Pro

Set up RStudio license server

To use RStudio Server Pro on Databricks, you need to convert your Pro License to a floating license. For assistance, contact support@rstudio.com. When your license is converted, you must set up a license server for RStudio Server Pro.

To set up a license server:

  1. Launch a small instance on your cloud provider network; the license server daemon is very lightweight.
  2. Download and install the corresponding version of RStudio License Server on your instance, and start the service. For detailed instructions, see RStudio Server Pro documentation.
  3. Make sure that the license server port is open to Databricks instances.

Install RStudio Server Pro

To set up RStudio Server Pro on a Databricks cluster, you must create an init script to install the RStudio Server Pro binary package and configure it to use your license server for license lease. See Cluster Node Initialization Scripts for more details. The following is an example notebook cell that installs an init script for a cluster named <cluster-name>. This script also places additional authentication configurations that make integration with Databricks smoother.

%python

script = """
  sudo apt-get update
  sudo apt-get install -y gdebi-core alien

  ## Installing RStudio Server Pro
  cd /tmp
  sudo wget https://download2.rstudio.org/rstudio-server-pro-1.1.453-amd64.deb
  sudo gdebi -n rstudio-server-pro-1.1.453-amd64.deb

  ## Configuring authentication
  sudo echo 'auth-proxy=1' >> /etc/rstudio/rserver.conf
  sudo echo 'auth-proxy-user-header-rewrite=^(.*)$ $1' >> /etc/rstudio/rserver.conf
  sudo echo 'auth-proxy-sign-in-url=<domain>/login.html' >> /etc/rstudio/rserver.conf
  sudo echo 'admin-enabled=1' >> /etc/rstudio/rserver.conf
  sudo echo ‘export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin’ >> /etc/rstudio/rsession-profile

  # Enabling floating license
  sudo echo 'server-license-type=remote' >> /etc/rstudio/rserver.conf

  # Session configurations
  sudo echo 'session-rprofile-on-resume-default=1' >> /etc/rstudio/rsession.conf
  sudo echo 'allow-terminal-websockets=0' >> /etc/rstudio/rsession.conf

  sudo rstudio-server license-manager license-server <license-server-url>
  sudo rstudio-server restart
"""

dbutils.fs.put("/databricks/init/<cluster-name>/rstudio-install.sh", script, True)
  1. Update <cluster-name> to the name of your cluster.
  2. Replace <domain> with your Databricks URL and <license-server-url> with the URL of your floating license server.
  3. Run the code in a notebook to install the script.
  4. Restart the cluster.

Use RStudio Server Pro

  1. Display the cluster details of the cluster on which you installed RStudio and click the Apps tab:

    Cluster Apps Tab
  2. In the Apps tab, click the Set up RStudio button.

    RStudio One-time Password
  3. You do not need the one-time password. Click the Open RStudio UI link and it will open an authenticated RStudio Pro session for you.

  4. From the RStudio UI, you can attach the SparkR package and set up a SparkR session to launch Spark jobs on your cluster.

    library(SparkR)
    sparkR.session()
    
    RStudio Session
  5. You can also attach the sparklyr package and set up a Spark connection.

    sparkR.session()
    library(sparklyr)
    sc <- spark_connect(method = "databricks")
    
    RStudio Session with sparklyr

Frequently asked questions (FAQ)

What is the difference between RStudio Server Open Source and RStudio Server Pro?

RStudio Server Pro supports a wide range of enterprise features that are not available on the Open Source edition. You can see a feature comparison on the RStudio Inc website.

In addition, RStudio Server Open Source is distributed under the GNU Affero General Public License (AGPL), while the Pro version comes with a commercial license for organizations that are not able to use AGPL software.

Finally, RStudio Server Pro comes with professional and enterprise support from RStudio Inc., while RStudio Server Open Source comes with no support.

Can I use my RStudio Server Pro license on Databricks?
Yes, if you already have a Pro or Enterprise license for RStudio Server, you can use that license on Databricks. See Get started with RStudio Server Pro to learn how to set up RStudio Server Pro on Databricks.
Where does RStudio Server run? Do I need to manage any additional services/servers?
As you can see on the diagram in RStudio integration architecture, the RStudio Server daemon runs on the driver (master) node of your Databricks high concurrency cluster. With RStudio Server Open Source, you do not need to run any additional servers/services. However, for RStudio Server Pro, you need to manage a separate instance that runs RStudio License Server.
Can I use RStudio Server on a standard cluster?
No, standard Databricks clusters do not support RStudio server integration. You must run on a high concurrency cluster.
How should I persist my work on RStudio?

We strongly recommend that you persist your work using a version control system from RStudio. RStudio has great support for various version control systems and allows you to check in and manage your projects.

You can also save your files (code or data) on the Databricks File System - DBFS. For example, if you save a file under /dbfs/ the files will not be deleted when your cluster is terminated or restarted.

Important

If you do not persist your code through version control or DBFS, you risk losing your work if an admin restarts or terminates the cluster.

How does RStudio integrate with Databricks R notebooks?
You can move your work between notebooks and RStudio through version control.
What is the working directory?
When you start a project in RStudio, you chose a working directory. By default this is the home directory on the driver (master) container where RStudio Server is running. You can change this directory if you want.
Can I launch Shiny Apps from RStudio running on Databricks?
Unfortunately, Shiny apps and RStudio Connect integration are not yet supported on Databricks.
I can’t use terminal/git inside RStudio on Databricks. How can I fix that?

Make sure that you have disabled websockets. In RStudio Server Open Source, you can do this from the UI.

RStudio Session

In RStudio Server Pro, you can add allow-terminal-websockets=0 to /etc/rstudio/rsession.conf to disable websockets for all users.

I don’t see the Apps tab under cluster details.
This feature is not available to all customers. You must be on the Databricks Operational Security Package. In addition, the Apps tab appears only on high concurrency clusters running Databricks Runtime 4.1 and above. Standard clusters do not support Apps and RStudio Integration.