Databricks SDK for R

Note

This article covers the Databricks SDK for R by Databricks Labs, which is in an Experimental state. To provide feedback, ask questions, and report issues, use the Issues tab in the Databricks SDK for R repository in GitHub.

In this article, you learn how to automate Databricks operations in Databricks workspaces with the Databricks SDK for R. This article supplements the Databricks SDK for R documentation.

Note

The Databricks SDK for R does not support the automation of operations in Databricks accounts. To call account-level operations, use a different Databricks SDK, for example:

Before you begin

Before you begin to use the Databricks SDK for R, your development machine must have:

  • A Databricks personal access token for the target Databricks workspace that you want to automate.

    Note

    The Databricks SDK for R supports Databricks personal access token authentication only.

  • R, and optionally an R-compatible integrated development environment (IDE). Databricks recommends RStudio Desktop and uses it in this article’s instructions.

Get started with the Databricks SDK for R

  1. Make your Databricks workspace URL and personal access token available to your R project’s scripts. For example, you can add the following to an R project’s .Renviron file. Replace <your-workspace-url> with your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com. Replace <your-personal-access-token> with your Databricks personal access token, for example dapi12345678901234567890123456789012.

    DATABRICKS_HOST=<your-workspace-url>
    DATABRICKS_TOKEN=<your-personal-access-token>
    

    To create a Databricks personal access token, follow the steps at Databricks personal access tokens for workspace users.

    For additional ways to provide your Databricks workspace URL and personal access token, see Authentication in the Databricks SDK for R repository in GitHub.

    Important

    Do not add .Renviron files to version control systems, as this risks exposing sensitive information such as Databricks personal access tokens.

  2. Install the Databricks SDK for R package. For example, in RStudio Desktop, in the Console view (View > Move Focus to Console), run the following commands, one at a time:

    install.packages("devtools")
    library(devtools)
    install_github("databrickslabs/databricks-sdk-r")
    

    Note

    The Databricks SDK for R package is not available on CRAN.

  3. Add code to reference the Databricks SDK for R and to list all of the clusters in your Databricks workspace. For example, in a project’s main.r file, the code might be as follows:

    require(databricks)
    
    client <- DatabricksClient()
    
    list_clusters(client)[, "cluster_name"]
    
  4. Run your script. For example, in RStudio Desktop, in the script editor with the a project’s main.r file active, click Source > Source or Source with Echo.

  5. The list of clusters appears. For example, in RStudio Desktop, this is in the Console view.

Code examples

The following code examples demonstrate how to use the Databricks SDK for R to create and delete clusters, and create jobs.

Create a cluster

This code example creates a cluster with the specified Databricks Runtime version and cluster node type. This cluster has one worker, and the cluster automatically terminates after 15 minutes of idle time.

require(databricks)

client <- DatabricksClient()

response <- create_cluster(
  client = client,
  cluster_name = "my-cluster",
  spark_version = "12.2.x-scala2.12",
  node_type_id = "i3.xlarge",
  autotermination_minutes = 15,
  num_workers = 1
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the cluster at ",
  host,
  "#setting/clusters/",
  response$cluster_id,
  "/configuration",
  sep = "")
)

Permanently delete a cluster

This code example permanently deletes the cluster with the specified cluster ID from the workspace.

require(databricks)

client <- DatabricksClient()

cluster_id <- readline("ID of the cluster to delete (for example, 1234-567890-ab123cd4):")

delete_cluster(client, cluster_id)

Create a job

This code example creates a Databricks job that can be used to run the specified notebook on the specified cluster. As this code runs, it gets the existing notebook’s path, the existing cluster ID, and related job settings from the user at the console.

require(databricks)

client <- DatabricksClient()

job_name <- readline("Some short name for the job (for example, my-job):")
description <- readline("Some short description for the job (for example, My job):")
existing_cluster_id <- readline("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4):")
notebook_path <- readline("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook):")
task_key <- readline("Some key to apply to the job's tasks (for example, my-key):")

print("Attempting to create the job. Please wait...")

notebook_task <- list(
  notebook_path = notebook_path,
  source = "WORKSPACE"
)

job_task <- list(
  task_key = task_key,
  description = description,
  existing_cluster_id = existing_cluster_id,
  notebook_task = notebook_task
)

response <- create_job(
  client,
  name = job_name,
  tasks = list(job_task)
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the job at ",
  host,
  "#job/",
  response$job_id,
  sep = "")
)

Logging

You can use the popular logging package to log messages. This package provides support for multiple logging levels and custom log formats. You can use this package to log messages to the console or to a file. To log messages, do the following:

  1. Install the logging package. For example, in RStudio Desktop, in the Console view (View > Move Focus to Console), run the following commands:

    install.packages("logging")
    library(logging)
    
  2. Bootstrap the logging package, set where to log the messages, and set the logging level. For example, the following code logs all ERROR messages and below to the results.log file.

    basicConfig()
    addHandler(writeToFile, file="results.log")
    setLevel("ERROR")
    
  3. Log messages as needed. For example, the following code logs any errors if the code cannot authenticate or list the names of the available clusters.

    require(databricks)
    require(logging)
    
    basicConfig()
    addHandler(writeToFile, file="results.log")
    setLevel("ERROR")
    
    tryCatch({
      client <- DatabricksClient()
    }, error = function(e) {
      logerror(paste("Error initializing DatabricksClient(): ", e$message))
      return(NA)
    })
    
    tryCatch({
      list_clusters(client)[, "cluster_name"]
    }, error = function(e) {
      logerror(paste("Error in list_clusters(client): ", e$message))
      return(NA)
    })
    

Testing

To test your code, you can use R test frameworks such as testthat. To test your code under simulated conditions without calling Databricks REST API endpoints or changing the state of your Databricks accounts or workspaces, you can use R mocking libraries such as mockery.

For example, given the following file named helpers.r containing a createCluster function that returns information about the new cluster:

library(databricks)

createCluster <- function(
  databricks_client,
  cluster_name,
  spark_version,
  node_type_id,
  autotermination_minutes,
  num_workers
) {
  response <- create_cluster(
    client = databricks_client,
    cluster_name = cluster_name,
    spark_version = spark_version,
    node_type_id = node_type_id,
    autotermination_minutes = autotermination_minutes,
    num_workers = num_workers
  )
  return(response)
}

And given the following file named main.R that calls the createCluster function:

library(databricks)
source("helpers.R")

client <- DatabricksClient()

# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response = createCluster(
  databricks_client = client,
  cluster_name = "my-cluster",
  spark_version = "<spark-version>",
  node_type_id = "<node-type-id>",
  autotermination_minutes = 15,
  num_workers = 1
)

print(response$cluster_id)

The following file named test-helpers.py tests whether the createCluster function returns the expected response. Rather than creating a cluster in the target workspace, this test mocks a DatabricksClient object, defines the mocked object’s settings, and then passes the mocked object to the createCluster function. The test then checks whether the function returns the new mocked cluster’s expected ID.

# install.packages("testthat")
# install.pacakges("mockery")
# testthat::test_file("test-helpers.R")
lapply(c("databricks", "testthat", "mockery"), library, character.only = TRUE)
source("helpers.R")

test_that("createCluster mock returns expected results", {
  # Create a mock response.
  mock_response <- list(cluster_id = "abc123")

  # Create a mock function for create_cluster().
  mock_create_cluster <- mock(return_value = mock_response)

  # Run the test with the mock function.
  with_mock(
    create_cluster = mock_create_cluster,
    {
      # Create a mock Databricks client.
      mock_client <- mock()

      # Call the function with the mock client.
      # Replace <spark-version> with the target Spark version string.
      # Replace <node-type-id> with the target node type string.
      response <- createCluster(
        databricks_client = mock_client,
        cluster_name = "my-cluster",
        spark_version = "<spark-version>",
        node_type_id = "<node-type-id>",
        autotermination_minutes = 15,
        num_workers = 1
      )

      # Check that the function returned the correct mock response.
      expect_equal(response$cluster_id, "abc123")
    }
  )
})