Service principals for CI/CD

This article describes how to use service principals for CI/CD with Databricks. A service principal is an identity created for use with automated tools and applications, including:

As a security best practice, Databricks recommends using a Databricks service principal and its Databricks access token instead of your Databricks user or your Databricks personal access token for your workspace user to give CI/CD platforms access to Databricks resources. Some benefits to this approach include the following:

  • You can grant and restrict access to Databricks resources for a Databricks service principal independently of a user. For instance, this allows you to prohibit a Databricks service principal from acting as an admin in your Databricks workspace while still allowing other specific users in your workspace to continue to act as admins.

  • Users can safeguard their access tokens from being accessed by CI/CD platforms.

  • You can temporarily disable or permanently delete a Databricks service principal without impacting other users. For instance, this allows you to pause or remove access from a Databricks service principal that you suspect is being used in a malicious way.

  • If a user leaves your organization, you can remove that user without impacting any Databricks service principal.

To give a CI/CD platform access to your Databricks workspace, do the following:

  1. Create a Databricks service principal in your workspace.

  2. Generate a Databricks access token for a Databricks service principal.

  3. Give this Databricks access token to the CI/CD platform.

To complete Steps 1 and 2, see Manage service principals.

To complete Step 3, complete the instructions in this article.

Optionally, if you also want to use your Databricks workspace with Databricks Repos in a CI/CD platform scenario, see Add Git provider credentials to a Databricks workspace. For example, you may want your Git provider to access your workspace, and you also want to use Databricks Repos in your workspace with your Git provider. However, you don’t need to use Databricks Repos in order to use your workspace with CI/CD platforms.

Requirements

  • The Databricks access token for a Databricks service principal. To create a Databricks service principal and its Databricks access token, see Manage service principals.

  • An account with your Git provider.

Set up GitHub Actions

GitHub Actions must be able to access your Databricks workspace. If you want to use Databricks Repos, your workspace must also be able to access GitHub.

To enable GitHub Actions to access your Databricks workspace, you must register the Databricks access token for your Databricks service principal with GitHub Actions.

If you also want to enable your Databricks workspace to access GitHub when you use Databricks Repos, you must add the GitHub personal access token for a GitHub machine user to your workspace.

Register the Databricks access token for your Databricks service principal with GitHub Actions

This section describes how to enable GitHub Actions to access your Databricks workspace.

As a security best practice, Databricks recommends that you do not enter a Databricks access token directly into the body of a GitHub Actions file. You should register the Databricks access token with GitHub Actions by using GitHub encrypted secrets instead.

GitHub Actions, such as the ones that Databricks lists in Continuous integration and delivery using GitHub Actions, as well as the onpush.yml and onrelease.yml files as part of the Basic Python Template in dbx for GitHub Actions, rely on GitHub encrypted secrets such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

  • DATABRICKS_TOKEN, which is the value of the token_value value that you copied after you created the Databricks access token for the Databricks service principal.

For more information about which GitHub encrypted secrets are required for a GitHub Action, see Manage service principals and the documentation for that GitHub Action.

To add these GitHub encrypted secrets to your GitHub repository, see Creating encrypted secrets for a repository in the GitHub documentation. For other approaches to add these GitHub repository secrets, see Encrypted secrets in the GitHub documentation.

Add the GitHub personal access token for a GitHub machine user to your Databricks workspace

This section describes how to enable your Databricks workspace to access GitHub with Databricks Repos. This is an optional task in CI/CD scenarios.

As a security best practice, Databricks recommends that you use GitHub machine users instead of GitHub personal accounts, for many of the same reasons that you should use a Databricks service principal instead of a Databricks user. To add the GitHub personal access token for a GitHub machine user to your Databricks workspace, do the following:

  1. Create a GitHub machine user, if you do not already have one available. A GitHub machine user is a GitHub personal account, separate from your own GitHub personal account, that you can use to automate activity on GitHub. Create a new separate GitHub account to use as a GitHub machine user, if you do not already have one available.

    Note

    When you create a new separate GitHub account as a GitHub machine user, you cannot associate it with the email address for your own GitHub personal account. Instead, see your organization’s email administrator about getting a separate email address that you can associate with this new separate GitHub account as a GitHub machine user.

    See your organization’s account administrator about managing the separate email address and its associated GitHub machine user and its GitHub personal access tokens within your organization.

  2. Give the GitHub machine user access to your GitHub repository. See Inviting a team or person in the GitHub documentation. To accept the invitation, you may first need to sign out of your GitHub personal account, and then sign back in as the GitHub machine user.

  3. Sign in to GitHub as the machine user, and then create a GitHub personal access token for that machine user. See Create a personal access token in the GitHub documentation. Be sure to give the GitHub personal access token repo access.

  4. Gather the Databricks access token for your Databricks service principal, your GitHub machine username, and then Add Git provider credentials to a Databricks workspace.

Set up Azure Pipelines

Azure Pipelines must be able to access your Databricks workspace. If you also want to use Databricks Repos, your workspace must be able to access Azure Pipelines.

Azure Pipelines YAML pipeline files rely on environment variables to access your Databricks workspace. These environment variables include ones such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

  • DATABRICKS_TOKEN, which is the value of the token_value value that you copied after you created the Databricks access token for the Databricks service principal.

To add these environment variables to your Azure pipeline, see Use Azure Key Value secrets in Azure Pipelines and Set secret variables in the Azure documentation.

See also the following Databricks blog:

Optional for CI/CD scenarios: If your workspace uses Databricks Repos, and you want to enable your workspace to access Azure Pipelines, gather:

  • The Databricks access token for your Databricks service principal

  • Your Azure Pipelines username

Then, Add Git provider credentials to a Databricks workspace.

Set up GitLab CI/CD

GitLab CI/CD must be able to access your Databricks workspace. If you also want to use Databricks Repos, your workspace must be able to access GitLab CI/CD.

To access your Databricks workspace, GitLab CI/CD .gitlab-ci.yml files, such as the one as part of the Basic Python Template in dbx, rely on custom CI/CD variables such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

  • DATABRICKS_TOKEN, which is the value of the token_value value that you copied after you created the Databricks access token for the Databricks service principal.

To add these custom variables to your GitLab CI/CD project, see Add a CI/CD variable to a project in the GitLab CI/CD documentation.

If your workspace uses Databricks Repos, and you want to enable your workspace to access GitLab CI/CD, gather:

Add Git provider credentials to a Databricks workspace

This section describes how to enable your Databricks workspace to access a Git provider for Databricks Repos. This is optional in CI/CD scenarios. For example, you may only want your Git provider to access your Databricks workspace, but you do not also want to use Databricks Repos in your workspace with your Git provider. If so, then skip this section.

Before you begin, gather the following information and tools:

  • The Databricks access token for your Databricks service principal.

  • The username associated with your Git provider.

  • The access token associated with the user for your Git provider.

  • A tool such as curl or Postman to call the create a Git credential entry operation in the Git Credentials API. You cannot use the Databricks user interface.

In the following instructions, use curl or Postman, replacing:

  • <service-principal-access-token> with the Databricks access token for your Databricks service principal. (Do not use the Databricks personal access token for your workspace user.)

    Tip

    To confirm that you are using the correct token, you can first use the Databricks access token for your Databricks service principal to call the SCIM API 2.0 (Me) for workspaces API, and review the output of the call.

  • <git-provider-access-token> with the access token associated with the user for your Git provider.

  • <git-provider-user-name> with the username associated with your Git provider.

  • <git-provider-short-name> with the short name associated with your Git provider:

    • For GitHub, use GitHub.

    • For Azure Pipelines, use AzureDevOpsServices.

    • For GitLab CI/CD, use GitLab.

Run the following command. Make sure the set-git-credentials.json file is in the same directory where you run this command. This command uses the environment variable DATABRICKS_HOST, representing your Databricks workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

curl -X POST \
${DATABRICKS_HOST}/api/2.0/git-credentials \
--header 'Authorization: Bearer <service-principal-access-token>' \
--data @set-git-credentials.json \
| jq .

set-git-credentials.json:

{
   "personal_access_token": "<git-provider-access-token>",
   "git_username": "<git-provider-user-name>",
   "git_provider": "<git-provider-short-name>"
}
  1. Create a new HTTP request (File > New > HTTP Request).

  2. In the HTTP verb drop-down list, select POST.

  3. For Enter request URL, enter http://<databricks-instance-name>/api/2.0/git-credentials, where <databricks-instance-name> is your Databricks workspace instance name, for example dbc-a1b2345c-d6e7.cloud.databricks.com.

  4. On the Authorization tab, in the Type list, select Bearer Token.

  5. For Token, enter the Databricks access token for the Databricks service principal (the <service-principal-access-token>).

  6. On the Headers tab, add the Key and Value pair of Content-Type and application/scim+json

  7. On the Body tab, select raw and JSON.

  8. Enter the following body payload:

    {
       "personal_access_token": "<git-provider-access-token>",
       "git_username": "<git-provider-user-name>",
       "git_provider": "<git-provider-short-name>"
    }
    
  9. Click Send.

Tip

To confirm that the call was successful, you can use the Databricks access token for your Databricks service principal to call the get Git credentials operation in the Git Credentials API., and review the output.