Databricks SDK for R
Note
This article covers the Databricks SDK for R by Databricks Labs, which is in an Experimental state. To provide feedback, ask questions, and report issues, use the Issues tab in the Databricks SDK for R repository in GitHub.
In this article, you learn how to automate Databricks operations in Databricks workspaces with the Databricks SDK for R. This article supplements the Databricks SDK for R documentation.
Note
The Databricks SDK for R does not support the automation of operations in Databricks accounts. To call account-level operations, use a different Databricks SDK, for example:
Before you begin
Before you begin to use the Databricks SDK for R, your development machine must have:
A Databricks personal access token for the target Databricks workspace that you want to automate.
Note
The Databricks SDK for R supports Databricks personal access token authentication only.
R, and optionally an R-compatible integrated development environment (IDE). Databricks recommends RStudio Desktop and uses it in this article’s instructions.
Get started with the Databricks SDK for R
Make your Databricks workspace URL and personal access token available to your R project’s scripts. For example, you can add the following to an R project’s
.Renviron
file. Replace<your-workspace-url>
with your workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
. Replace<your-personal-access-token>
with your Databricks personal access token, for exampledapi12345678901234567890123456789012
.DATABRICKS_HOST=<your-workspace-url> DATABRICKS_TOKEN=<your-personal-access-token>
To create a Databricks personal access token, follow the steps at Databricks personal access tokens for workspace users.
For additional ways to provide your Databricks workspace URL and personal access token, see Authentication in the Databricks SDK for R repository in GitHub.
Important
Do not add
.Renviron
files to version control systems, as this risks exposing sensitive information such as Databricks personal access tokens.Install the Databricks SDK for R package. For example, in RStudio Desktop, in the Console view (View > Move Focus to Console), run the following commands, one at a time:
install.packages("devtools") library(devtools) install_github("databrickslabs/databricks-sdk-r")
Note
The Databricks SDK for R package is not available on CRAN.
Add code to reference the Databricks SDK for R and to list all of the clusters in your Databricks workspace. For example, in a project’s
main.r
file, the code might be as follows:require(databricks) client <- DatabricksClient() list_clusters(client)[, "cluster_name"]
Run your script. For example, in RStudio Desktop, in the script editor with the a project’s
main.r
file active, click Source > Source or Source with Echo.The list of clusters appears. For example, in RStudio Desktop, this is in the Console view.
Code examples
The following code examples demonstrate how to use the Databricks SDK for R to create and delete clusters, and create jobs.
Create a cluster
This code example creates a cluster with the specified Databricks Runtime version and cluster node type. This cluster has one worker, and the cluster automatically terminates after 15 minutes of idle time.
require(databricks)
client <- DatabricksClient()
response <- create_cluster(
client = client,
cluster_name = "my-cluster",
spark_version = "12.2.x-scala2.12",
node_type_id = "i3.xlarge",
autotermination_minutes = 15,
num_workers = 1
)
# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]
# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
host <- paste(host, "/", sep = "")
}
print(paste(
"View the cluster at ",
host,
"#setting/clusters/",
response$cluster_id,
"/configuration",
sep = "")
)
Permanently delete a cluster
This code example permanently deletes the cluster with the specified cluster ID from the workspace.
require(databricks)
client <- DatabricksClient()
cluster_id <- readline("ID of the cluster to delete (for example, 1234-567890-ab123cd4):")
delete_cluster(client, cluster_id)
Create a job
This code example creates a Databricks job that can be used to run the specified notebook on the specified cluster. As this code runs, it gets the existing notebook’s path, the existing cluster ID, and related job settings from the user at the console.
require(databricks)
client <- DatabricksClient()
job_name <- readline("Some short name for the job (for example, my-job):")
description <- readline("Some short description for the job (for example, My job):")
existing_cluster_id <- readline("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4):")
notebook_path <- readline("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook):")
task_key <- readline("Some key to apply to the job's tasks (for example, my-key):")
print("Attempting to create the job. Please wait...")
notebook_task <- list(
notebook_path = notebook_path,
source = "WORKSPACE"
)
job_task <- list(
task_key = task_key,
description = description,
existing_cluster_id = existing_cluster_id,
notebook_task = notebook_task
)
response <- create_job(
client,
name = job_name,
tasks = list(job_task)
)
# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]
# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
host <- paste(host, "/", sep = "")
}
print(paste(
"View the job at ",
host,
"#job/",
response$job_id,
sep = "")
)
Logging
You can use the popular logging
package to log messages. This package provides support for multiple logging levels and custom log formats. You can use this package to log messages to the console or to a file. To log messages, do the following:
Install the
logging
package. For example, in RStudio Desktop, in the Console view (View > Move Focus to Console), run the following commands:install.packages("logging") library(logging)
Bootstrap the logging package, set where to log the messages, and set the logging level. For example, the following code logs all
ERROR
messages and below to theresults.log
file.basicConfig() addHandler(writeToFile, file="results.log") setLevel("ERROR")
Log messages as needed. For example, the following code logs any errors if the code cannot authenticate or list the names of the available clusters.
require(databricks) require(logging) basicConfig() addHandler(writeToFile, file="results.log") setLevel("ERROR") tryCatch({ client <- DatabricksClient() }, error = function(e) { logerror(paste("Error initializing DatabricksClient(): ", e$message)) return(NA) }) tryCatch({ list_clusters(client)[, "cluster_name"] }, error = function(e) { logerror(paste("Error in list_clusters(client): ", e$message)) return(NA) })
Testing
To test your code, you can use R test frameworks such as testthat. To test your code under simulated conditions without calling Databricks REST API endpoints or changing the state of your Databricks accounts or workspaces, you can use R mocking libraries such as mockery.
For example, given the following file named helpers.r
containing a createCluster
function that returns information about the new cluster:
library(databricks)
createCluster <- function(
databricks_client,
cluster_name,
spark_version,
node_type_id,
autotermination_minutes,
num_workers
) {
response <- create_cluster(
client = databricks_client,
cluster_name = cluster_name,
spark_version = spark_version,
node_type_id = node_type_id,
autotermination_minutes = autotermination_minutes,
num_workers = num_workers
)
return(response)
}
And given the following file named main.R
that calls the createCluster
function:
library(databricks)
source("helpers.R")
client <- DatabricksClient()
# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response = createCluster(
databricks_client = client,
cluster_name = "my-cluster",
spark_version = "<spark-version>",
node_type_id = "<node-type-id>",
autotermination_minutes = 15,
num_workers = 1
)
print(response$cluster_id)
The following file named test-helpers.py
tests whether the createCluster
function returns the expected response. Rather than creating a cluster in the target workspace, this test mocks a DatabricksClient
object, defines the mocked object’s settings, and then passes the mocked object to the createCluster
function. The test then checks whether the function returns the new mocked cluster’s expected ID.
# install.packages("testthat")
# install.pacakges("mockery")
# testthat::test_file("test-helpers.R")
lapply(c("databricks", "testthat", "mockery"), library, character.only = TRUE)
source("helpers.R")
test_that("createCluster mock returns expected results", {
# Create a mock response.
mock_response <- list(cluster_id = "abc123")
# Create a mock function for create_cluster().
mock_create_cluster <- mock(return_value = mock_response)
# Run the test with the mock function.
with_mock(
create_cluster = mock_create_cluster,
{
# Create a mock Databricks client.
mock_client <- mock()
# Call the function with the mock client.
# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response <- createCluster(
databricks_client = mock_client,
cluster_name = "my-cluster",
spark_version = "<spark-version>",
node_type_id = "<node-type-id>",
autotermination_minutes = 15,
num_workers = 1
)
# Check that the function returned the correct mock response.
expect_equal(response$cluster_id, "abc123")
}
)
})