Databricks SDK para R

nota

Este artigo aborda o Databricks SDK for R da Databricks Labs, que está em um estado experimental. Para fornecer feedback, fazer perguntas e relatar problemas, use o Issues tab no repositório Databricks SDK for R em GitHub.

Neste artigo, o senhor aprenderá a automatizar Databricks operações no espaço de trabalho Databricks com o Databricks SDK para R. Este artigo complementa a documentação do siteDatabricks SDK for R.

nota

O Databricks SDK para R não suporta a automação de operações na conta Databricks. Para chamar account-level operações, use um diferente,Databricks SDK por exemplo:

Antes de começar

Antes de começar a usar o SDK da Databricks para R, sua máquina de desenvolvimento deve ter:

A Databricks tokens de acesso pessoal para o alvo Databricks workspace que o senhor deseja automatizar.

nota

O Databricks SDK para R é compatível apenas com a autenticação de tokens de acesso pessoal Databricks.

R e, opcionalmente, um ambiente de desenvolvimento integrado (IDE) compatível com R. Databricks recomenda oRStudio Desktop e o utiliza nas instruções deste artigo.

Começar com o Databricks SDK para R

Disponibilize o URL Databricks workspace e os tokens de acesso pessoal para os scripts do projeto R. Por exemplo, você pode adicionar o seguinte ao arquivo .Renviron de um projeto R. Substitua <your-workspace-url> pelo URL da instânciaworkspace, por exemplo, https://dbc-a1b2345c-d6e7.cloud.databricks.com. Substitua <your-personal-access-token> por seus tokens de acesso pessoal Databricks, por exemplo, dapi12345678901234567890123456789012.
```
DATABRICKS_HOST=<your-workspace-url>
DATABRICKS_TOKEN=<your-personal-access-token>
```
Para criar tokens de acesso pessoal Databricks, siga as etapas em Databricks personal access tokens para usuários workspace.

Para outras formas de fornecer seu URL Databricks workspace e tokens de acesso pessoal, consulte Autenticação no repositório Databricks SDK for R em GitHub.

important

Não adicione arquivos .Renviron a sistemas de controle de versão, pois isso pode expor informações confidenciais, como Databricks acesso pessoal tokens.

Instale o pacote Databricks SDK for R. Por exemplo, no RStudio Desktop, no Console view (view > Move Focus to Console ), execute os seguintes comandos, um de cada vez:
R
```
install.packages("devtools")
library(devtools)
install_github("databrickslabs/databricks-sdk-r")
```

nota

O pacote Databricks SDK for R não está disponível no CRAN.

Adicione código para fazer referência ao Databricks SDK para R e para listar todos os agrupamentos em seu Databricks workspace. Por exemplo, no arquivo main.r de um projeto, o código pode ser o seguinte:
R
```
require(databricks)

client <- DatabricksClient()

list_clusters(client)[, "cluster_name"]
```
executar seu script. Por exemplo, no RStudio Desktop, no editor de scripts com o arquivo main.r de um projeto ativo, clique em Source > Source ou Source with Echo .
A lista de clustering é exibida. Por exemplo, no RStudio Desktop, isso está no Console view.

Exemplos de código

Os exemplos de código a seguir demonstram como usar o Databricks SDK for R para criar e excluir clustering e criar Job.

Criar um cluster
Excluir permanentemente um cluster
Criar um job

Criar um clustering

Este exemplo de código cria um clustering com a versão Databricks Runtime e o tipo de nó de clustering especificados. Esse clustering tem um worker, e o clustering é automaticamente encerrado após 15 minutos de tempo parado.

R
require(databricks)

client <- DatabricksClient()

response <- create_cluster(
  client = client,
  cluster_name = "my-cluster",
  spark_version = "12.2.x-scala2.12",
  node_type_id = "i3.xlarge",
  autotermination_minutes = 15,
  num_workers = 1
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the cluster at ",
  host,
  "#setting/clusters/",
  response$cluster_id,
  "/configuration",
  sep = "")
)

Excluir permanentemente um clustering

Este exemplo de código exclui permanentemente o cluster com o ID de cluster especificado do workspace.

R
require(databricks)

client <- DatabricksClient()

cluster_id <- readline("ID of the cluster to delete (for example, 1234-567890-ab123cd4):")

delete_cluster(client, cluster_id)

Criar um trabalho

Esse exemplo de código cria um trabalho Databricks que pode ser usado para executar o Notebook especificado no clustering especificado. Ao executar esse código, ele obtém o caminho do Notebook existente, o ID de clustering existente e as configurações de trabalho relacionadas do usuário no console.

R
require(databricks)

client <- DatabricksClient()

job_name <- readline("Some short name for the job (for example, my-job):")
description <- readline("Some short description for the job (for example, My job):")
existing_cluster_id <- readline("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4):")
notebook_path <- readline("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook):")
task_key <- readline("Some key to apply to the job's tasks (for example, my-key):")

print("Attempting to create the job. Please wait...")

notebook_task <- list(
  notebook_path = notebook_path,
  source = "WORKSPACE"
)

job_task <- list(
  task_key = task_key,
  description = description,
  existing_cluster_id = existing_cluster_id,
  notebook_task = notebook_task
)

response <- create_job(
  client,
  name = job_name,
  tasks = list(job_task)
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the job at ",
  host,
  "#job/",
  response$job_id,
  sep = "")
)

Registro

O senhor pode usar o popular pacote logging para log mensagens. Esse pacote oferece suporte a vários níveis de registro e formatos personalizados do site log. O senhor pode usar esse pacote para log mensagens no console ou em um arquivo. Para acessar log mensagens, faça o seguinte:

Instale o pacote logging. Por exemplo, no RStudio Desktop, no Console view (view > Move Focus to Console ), execute o seguinte comando:
R
```
install.packages("logging")
library(logging)
```
Inicialize o pacote de registro, defina onde log as mensagens e defina o nível de registro. Por exemplo, o código a seguir logs todas as mensagens ERROR e abaixo para o arquivo results.log.
R
```
basicConfig()
addHandler(writeToFile, file="results.log")
setLevel("ERROR")
```

mensagens de registro conforme necessário. Por exemplo, o código a seguir logs qualquer erro se o código não puder autenticar ou listar os nomes do clustering disponível.

R
require(databricks)
require(logging)

basicConfig()
addHandler(writeToFile, file="results.log")
setLevel("ERROR")

tryCatch({
  client <- DatabricksClient()
}, error = function(e) {
  logerror(paste("Error initializing DatabricksClient(): ", e$message))
  return(NA)
})

tryCatch({
  list_clusters(client)[, "cluster_name"]
}, error = function(e) {
  logerror(paste("Error in list_clusters(client): ", e$message))
  return(NA)
})

Testando

Para testar seu código, você pode usar estruturas de teste R, como testthat. Para testar seu código em condições simuladas sem chamar o endpoint Databricks REST API ou alterar o estado de sua conta ou espaço de trabalho Databricks, o senhor pode usar a biblioteca de simulação do R, como o mockery.

Por exemplo, dado o seguinte arquivo chamado helpers.r contendo uma função createCluster que retorna informações sobre o novo cluster:

R
library(databricks)

createCluster <- function(
  databricks_client,
  cluster_name,
  spark_version,
  node_type_id,
  autotermination_minutes,
  num_workers
) {
  response <- create_cluster(
    client = databricks_client,
    cluster_name = cluster_name,
    spark_version = spark_version,
    node_type_id = node_type_id,
    autotermination_minutes = autotermination_minutes,
    num_workers = num_workers
  )
  return(response)
}

E dado o seguinte arquivo chamado main.R que chama a função createCluster:

R
library(databricks)
source("helpers.R")

client <- DatabricksClient()

# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response = createCluster(
  databricks_client = client,
  cluster_name = "my-cluster",
  spark_version = "<spark-version>",
  node_type_id = "<node-type-id>",
  autotermination_minutes = 15,
  num_workers = 1
)

print(response$cluster_id)

O arquivo a seguir chamado test-helpers.py testa se a função createCluster retorna a resposta esperada. Em vez de criar um clustering no destino workspace, esse teste simula um objeto DatabricksClient, define as configurações do objeto simulado e, em seguida, passa o objeto simulado para a função createCluster. Em seguida, o teste verifica se a função retorna o ID esperado do novo clustering simulado.

R
# install.packages("testthat")
# install.pacakges("mockery")
# testthat::test_file("test-helpers.R")
lapply(c("databricks", "testthat", "mockery"), library, character.only = TRUE)
source("helpers.R")

test_that("createCluster mock returns expected results", {
  # Create a mock response.
  mock_response <- list(cluster_id = "abc123")

  # Create a mock function for create_cluster().
  mock_create_cluster <- mock(return_value = mock_response)

  # Run the test with the mock function.
  with_mock(
    create_cluster = mock_create_cluster,
    {
      # Create a mock Databricks client.
      mock_client <- mock()

      # Call the function with the mock client.
      # Replace <spark-version> with the target Spark version string.
      # Replace <node-type-id> with the target node type string.
      response <- createCluster(
        databricks_client = mock_client,
        cluster_name = "my-cluster",
        spark_version = "<spark-version>",
        node_type_id = "<node-type-id>",
        autotermination_minutes = 15,
        num_workers = 1
      )

      # Check that the function returned the correct mock response.
      expect_equal(response$cluster_id, "abc123")
    }
  )
})

Recurso adicional

Para saber mais, consulte:

Antes de começar​

Começar com o Databricks SDK para R​

Exemplos de código​

Criar um clustering​

Excluir permanentemente um clustering​

Criar um trabalho​

Registro​

Testando​

Recurso adicional​

Antes de começar

Começar com o Databricks SDK para R

Exemplos de código

Criar um clustering

Excluir permanentemente um clustering

Criar um trabalho

Registro

Testando

Recurso adicional