This article provides an overview of security controls and configurations for deployment and management of Databricks accounts and workspaces. For information about securing your data, see Data governance best practices.
Not all security features are available on all pricing tiers. See the Databricks AWS pricing page to learn how features align to pricing plans.
This article focuses on the most recent (E2) version of the Databricks platform. Some of the features described here may not be supported on legacy deployments that have not migrated to the E2 platform.
In Databricks, a workspace is a Databricks deployment in the cloud that functions as the unified environment that a specified set of users use for accessing all of their Databricks assets. Your organization can choose to have multiple workspaces or just one, depending on your needs.
A Databricks account represents a single entity for purposes of billing and support. An account can include multiple workspaces.
Account admins handle general account management and workspace admins manage the settings and features of individual workspaces in the account. To learn more about Databricks admins, see Databricks administration guide. Admins can deploy workspaces with security configurations including:
An AWS Virtual Private Cloud (VPC) lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network. The VPC is the network location for your Databricks clusters. By default, Databricks creates and manages a VPC for the Databricks workspace.
You can instead provide your own VPC to host your Databricks clusters, enabling you to maintain more control of your own AWS account and limit outgoing connections. To take advantage of a customer-managed VPC, you must specify a VPC when you first create the Databricks workspace. You can share VPCs across workspaces, but you cannot share subnets across workspaces. For more information, see Customer-managed VPC.
Databricks supports adding a customer-managed key to help protect and control access to data. There are three customer-managed key features for different types of data:
Customer-managed keys for managed services: Managed services data in the Databricks control plane is encrypted at rest. You can add a customer-managed key for managed services to help protect and control access to the following types of encrypted data:
Notebook source files that are stored in the control plane.
Notebook results for notebooks that are stored in the control plane.
Secrets stored by the secret manager APIs.
Databricks SQL queries and query history.
Personal access tokens or other credentials used to set up Git integration with Databricks Repos.
For more information, see Customer-managed keys for managed services.
Customer-managed keys for workspace storage: You can configure your own key to encrypt the data on the Amazon S3 bucket in your AWS account that you specified when you created your workspace. You can optionally use the same key to encrypt your cluster’s EBS volumes. For more information, see Customer-managed keys for workspace storage
For more details of which customer-managed key features in Databricks protect different types kinds of data, see Customer-managed keys for encryption.
Users, groups, and service principals are configured in the Databricks account and workspaces by administrators. For information on how to securely configure identity in Databricks, see Identity best practices.
For REST API authentication, you can use built-in revocable Databricks personal access tokens. You can create personal access tokens in the web application user interface or using the Tokens API.
Workspace admins can use the Token Management API to review current Databricks personal access tokens, delete tokens, and set the maximum lifetime of new tokens for their workspace. You can use the related Permissions API to control which users can create and use tokens to access workspace REST APIs.
While Databricks strongly recommends using tokens, Databricks users on AWS can also access REST APIs using their Databricks username and password (native authentication). You grant and revoke the ability for specific users to use native authentication using password access control.
Authentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud service from an unsecured network poses security risks, especially when the user may have authorized access to sensitive or personal data. With IP access lists, you can configure Databricks workspaces so that users connect to the service only through existing networks with a secure perimeter.
Workspace admins can specify the IP addresses (or CIDR ranges) on the public network that are allowed access. These IP addresses could belong to egress gateways or specific user environments. You can also specify IP addresses or subnets to block. For details, see IP access lists.
You can also use PrivateLink to block all public internet access to a Databricks workspace.
Databricks provides access to audit logs of activities performed by Databricks users, allowing you to monitor detailed usage patterns. You can configure two types of audit and usage logging:
You can use cluster policies to enforce particular cluster settings, such as instance types, number of nodes, attached libraries, and compute cost, and display different cluster-creation interfaces for different user levels. Managing cluster configurations using policies can help enforce universal governance controls and manage the costs of your compute infrastructure. For more information, see Manage cluster policies.
In Databricks, you can use access control lists (ACLs) to configure permission to access objects, such as: notebooks, experiments, models, clusters, jobs, dashboards, queries, and SQL warehouses. All admin users can manage access control lists, as can users who have been given delegated permissions to manage access control lists. See Access control.
For information about managing access to your organization’s data, see Data governance guide.
You can use Databricks secrets to store credentials and reference them in notebooks and jobs. A secret is a key-value pair that stores secret material for an external data source or other calculation, with a key name unique within a secret scope. You should never hard code secrets or store them in plain text.
You create secrets using either the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a notebook or job to read your secrets.
For information on how to use Databricks secrets, see Secret management.
Using Databricks REST APIs, some of your security configuration tasks can be automated using Terraform or AWS Quick Start (CloudFormation) templates. These templates can be used to configure and deploy new workspaces as well as to update administrative configurations for existing workspaces. Particularly for large companies with dozens of workspaces, using templates can enable fast and consistent automated configurations.
Here are some resources to help you build a comprehensive security solution that meets your organization’s needs:
The Databricks Security and Trust Center, which provides information about the ways in which security is built into every layer of the Databricks Lakehouse Platform.
Security Best Practices, which provides a checklist of security practices, considerations and patterns that you can apply to your deployment, learned from our enterprise engagements.
Data governance best practices to implement data governance controls for your organization.