This article provides an overview of the most important security-related controls and configurations for deployment of the E2 version of the Databricks Unified Data Analytics Platform.
This article illustrates some scenarios using example companies to compare how small and large organizations might handle deployment differently. There are references to the fictional large corporation LargeCorp and the fictional small company SmallCorp. Use these examples as general guides, but every company is different. If you have questions, contact your Databricks representative.
For detailed information about specific security features, see Databricks security guide. Your Databricks representative can also provide you with additional security and compliance documentation for the E2 platform.
This article discusses features that are not available on all pricing plans, deployment types, and regions. Some features are in Public Preview. For questions about availability, contact your Databricks representative.
Talk to your Databricks representative about the features you want. They will help you choose a pricing plan (pricing tier) and deployment type (E2 or custom). Not all features are available on all tiers, deployment types, and regions. See the Databricks AWS pricing page to learn how features align to pricing plans.
This article assumes that your account is on the E2 version of the platform. The following security-related features are available only in E2 deployments:
- Customer-managed VPC: Deploy a Databricks workspace in a VPC in your AWS account. Requires the Premium plan.
- Secure cluster connectivity: VPCs have no open ports and Databricks Runtime cluster nodes have no public IP addresses. With the E2 version of the platform, secure cluster connectivity is enabled by default. Requires the Premium plan.
- Customer-managed keys for managed services: Encrypt notebook and secret data using an AWS KMS key that you manage. This feature is available in Public Preview and requires the Enterprise plan.
Most of the other security-related features discussed in this article are available on the Premium plan. However, some require the Enterprise plan:
- IP access lists: Enforce network location of workspace users.
Single sign-on (SSO) is available on all plans.
A Databricks workspace is an environment for accessing your Databricks assets. The workspace organizes your objects (notebooks, libraries, and experiments) into folders. Your workspace provides access to data and computational resources such as clusters and jobs.
Determine how many workspaces your organization will need, which teams need to collaborate, and your requirements for geographic regions.
A small organization such as our example SmallCorp might only need one or a small number of workspaces. This might also be true of a single division of a larger company that is relatively self-contained. Workspace administrators could be regular users of the workspace. In some cases, a separate department (IT/OpSec) might take on the role of workspace administrator to deploy according to enterprise governance policies and manage permissions, users, and groups.
A large organization such as our example LargeCorp typically requires many workspaces. LargeCorp already has a centralized group (IT/OpSec) that handles all security and administrative functions. That centralized group typically sets up new workspaces and enforces security controls across the company.
Common reasons a large corporation might create separate workspaces:
- Teams handle different levels of confidential information, possibly including personally identifying information. By separating workspaces, teams keep different levels of confidential assets separate without additional complexity such as access control lists. For example, the LargeCorp finance team can easily store its finance-related notebooks separate from workspaces used by other departments.
- Simplified billing for Databricks usage (DBUs) and cloud compute to be charged back to different budgets.
- Geographic region variations of teams or data sources. Teams in one region might prefer cloud resources based in a different region for cost, network latency, or legal compliance. Each workspace can be defined in a different supported region.
Workspace data plane VPCs can be in AWS regions
us-west-2. However, you cannot use a VPC in
us-west-1 if you want to use customer-managed keys to encrypt managed services or workspace storage.
Although workspaces are a common approach to segregate access to resources by team, project, or geography, there are other options. Workspace administrators can use access control lists (ACLs) within a workspace to limit access to resources such as notebooks, folders, jobs, and more, based on user and group memberships. Another option for controlling differential access to data source in a single workspace is credential passthrough.
An AWS Virtual Private Cloud (VPC) lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network. The VPC is the network location for your Databricks clusters. By default, Databricks creates and manages a VPC for the Databricks workspace.
Using the Customer-managed VPC feature, you can provide your own customer-managed VPC that hosts clusters for your workspace. You control your IP address range, subnets, route tables, and NAT gateways.
SmallCorp might have only a single customer-managed VPC for Databricks workspaces. If there are two or three workspaces, depending on the network architecture and regions, they may or may not share a single VPC with multiple workspaces. Optionally, SmallCorp can choose to use a VPC that Databricks creates and manages, in which case Databricks creates the VPC automatically but needs more permissions assigned to the cross-account IAM role that SmallCorp provides for Databricks to use.
If yours is a larger organization, you might want to create and specify a customer-managed VPC for Databricks to use. You could have multiple workspaces share a VPC to simplify AWS resource allocation. Or if your organization has VPCs in different AWS accounts and different regions, you might allocate workspaces and VPCs differently.
One reason a bigger company like LargeCorp might group multiple workspaces in a single VPC is to centralize configuration of similar egress rules by VPC. For example, perhaps five departments work on similar data sources so they share the same VPC. However, data analysts in the finance team might need special network access to a special internal database, and it is critical to reduce risk of unauthorized access. By grouping and isolating teams in separate VPCs (which may be in different AWS accounts), data sources can be allowed or blocked at a network level by the appropriate Cloud NAT or firewall appliance.
When you design your network architecture and deciding whether to share VPCs across workspaces, consider how you want to lock down egress network connections from cluster nodes to individual data sources.
For accounts on the E2 version of the platform, you can use either the account console or the Account API 2.0 to create new workspaces. Alternatively, you can provision workspaces with automation templates using Terraform or AWS Quick Start (CloudFormation).
The general approach for creating new workspaces is:
- Create an IAM cross-account role (delegation credential) that lets Databricks perform relevant tasks with the workspace. Next, use either the account console or the Account API to create a credential configuration that encapsulates the IDs for your new role.
- Create an S3 bucket to store some workspace data such as libraries, logs, and notebook revision history. Do not use this root storage for production customer data. Next, use either the account console or the Account API to create a storage configuration that encapsulates the S3 bucket name.
- Optionally provide a customer-managed VPC. If you have not created a VPC for your workspace yet, do that now but carefully read the VPC requirements before proceeding. Use either the account console or the Account API to create a network configuration that encapsulates IDs for your VPC, subnets, and security groups.
- Available only if you use the Account API (not the account console), optionally provide Customer-managed keys for managed services and Customer-managed keys for workspace storage. Use the Account API to provide a key configuration that contains the ID of your AWS KMS keys.
- Use the account console or the Account API to create a new workspace that references your configuration object.
You can create multiple workspaces for your account by repeating the procedure. Some resources but not all can be shared across workspaces:
- You can reuse a credential configuration across workspaces.
- You can reuse a storage configuration across workspaces.
- You can re-use a VPC, but you cannot reuse the subnets in additional workspaces. Because a network configuration encapsulates the VPC’s subnet IDs, you cannot reuse a network configuration for another workspace.
- You can share a customer-managed key across workspaces.
For complete details:
- For account console, see Create and manage workspaces using the account console.
- For Account API, see Create a new workspace using the Account API.
By default, users authenticate with Databricks native authentication, which means that local user accounts and credentials are managed within the Databricks control plane.
Most security-conscious organizations implement single sign-on (SSO) using SAML 2.0. If your SAML 2.0 Identity Provider (IdP) supports multi-factor authentication (MFA), it works with Databricks but the IdP is responsible for the implementation. Databricks does not have access to the user’s SSO credentials. See Set up single sign-on.
If you enable the SAML configuration feature called Allow Auto User Creation, local Databricks accounts for users are provisioned as needed during SSO login. This is sometimes called Just-in-time (JIT) provisioning.
Databricks supports System for Cross-domain Identity Management (SCIM). Most IdPs have built-in support for SCIM synchronization of users and groups, which enables provisioning and deprovisioning of Databricks accounts. Although you can use user names to manage access control lists (ACLs) for Databricks resources such as notebooks, SCIM synchronization of groups makes it even easier to manage Databricks ACLs. See Access control lists (ACLs). SCIM is available as Public Preview.
When SSO is enabled, the default behavior is that only admin users with local passwords can log in to the web application. Using a locally-stored password for login is known as native authentication. For REST APIs, the default behavior is that all users with local (native authentication) passwords can authenticate. Administrators can use the Admin Console or the Permissions API 2.0 to set password permissions to limit which users can connect with native authentication (to the web application or REST API) when SSO is enabled.
For REST API authentication, use built-in revocable Databricks personal access tokens. Users create personal access tokens in the web application user interface.
There is a Token Management API that you can use to review current Databricks personal access tokens, delete tokens, and set the maximum lifetime of new tokens. Use the related Permissions API to set token permissions that define which users can create and use tokens to access workspce REST APIs.
Use token permissions to enforce the principle of least privilege so that any individual user or group has access to REST APIs only if they have a legitimate need.
For workspaces created after the release of Databricks platform version 3.28 (Sept 9-15, 2020), by default only admin users have the ability to generate personal access tokens. Admins must explicitly grant those permissions, whether to the entire
users group or on a user-by-user or group-by-group basis. Workspaces created before 3.28 maintain the permissions that were already in place before this change but had different defaults. If you are not sure when a workspace was created, review the tokens permissions for the workspace.
For a complete list of APIs and admin console tools for tokens, see Manage personal access tokens.
While Databricks strongly recommends using tokens, Databricks users on AWS can also access REST APIs using their Databricks username and password (native authentication). You can disable native authentication to REST APIs using password access control and assign access to individual users or groups. Configuration is available both in the Admin console and Permissions REST API.
Authentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud service from an unsecured network poses security risks, especially when the user may have authorized access to sensitive or personal data. Enterprise network perimeters (for example, firewalls, proxies, DLP, and logging) apply security policies and limit access to external services, so access beyond these controls is assumed to be untrusted.
For example, if an employee walks from the office to a coffee shop, the company can block connections to the Databricks workspace even if the customer has correct credentials to access the web application and the REST API.
Specify the IP addresses (or CIDR ranges) on the public network that are allowed access. These IP addresses could belong to egress gateways or specific user environments. You can also specify IP addresses or subnets to block, even if they are included in the allow list. For example, an allowed IP address range might include a smaller range of infrastructure IP addresses that in practice are outside the actual secure network perimeter.
For details, see IP access lists.
Databricks strongly recommends that you configure audit and usage logging to monitor the activities performed and usage incurred by your Databricks users:
- Billable usage log delivery: Automated delivery of usage logs to an AWS S3 bucket. See Deliver and access billable usage logs.
- Audit log delivery: Automated delivery of audit logs to an AWS S3 bucket. See Configure audit logging.
You must contact your Databricks representative to enable audit logs for your new workspace. After they are enabled, try examining your audit logs. Billable usage logs are enabled by default and viewable on the account console, but you must configure log delivery to an AWS S3 bucket to get the log files themselves.
Use cluster policies to enforce particular cluster settings, such as instance types, number of nodes, attached libraries, and compute cost, and display different cluster-creation interfaces for different user levels. Managing cluster configurations using policies can help enforce universal governance controls and manage the costs of your compute infrastructure.
A small organization like SmallCorp might have a single cluster policy for all clusters.
A large organization like LargeCorp might have more complex policies, for example:
- Customer data analysts who work on extremely large data sets and complex calculations might be allowed to have clusters of up to hundred nodes.
- Finance team data analysts might be allowed to use clusters of up to ten nodes.
- Human resources department that works with smaller datasets and simpler notebooks might only be allowed to have autoscaling clusters of four to eight nodes.
Admins can also delegate some ACL configurations to non-admin users by granting
Can manage permission (for example, any user with
Can Manage permission for a cluster can give other users permission to attach to, restart, resize, and manage that cluster.)
You can set the following ACLs for users and groups. Unless otherwise specified, you can modify ACLs using both the web application and REST API:
|Table||Enforce access to data tables.|
|Cluster||Manage which users can manage, restart, or attach to clusters. Access to clusters affects security when a cluster is configured with passthrough authentication for data sources, see Data source credential passthrough.|
|Pool||Manage which users can manage or attach to pools. Some APIs and documentation refer to pools as instance pools. pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use cloud instances. When a cluster attached to a pool needs an instance, it first attempts to allocate one of the idle instances of the pool. If the pool has no idle instances, it expands by allocating a new instance from the instance provider in order to accommodate the cluster’s request. When a cluster releases an instance, it returns to the pool and is free for another cluster to use. Only clusters attached to a pool can use the idle instances of the pool.|
|Jobs||Manage which users can view, manage, trigger, cancel, or own a job.|
|Notebook||Manage which users can read, run, edit or manage a notebook.|
|Folder (directory||Manage which users can read, run, edit, or manage all notebooks in a folder.|
|MLflow registered model and experiment||Manage which users can read, edit, or manage MLflow registered models and experiments.|
|Token||Manage which users can create or use tokens. See also Secure API access|
|Password||Manage which users can use password login when SSO is enabled and when using REST APIs. See also Authentication and user account provisioning.|
Depending on team size and sensitivity of the information, a small company like SmallCorp or a small team within LargeCorp with its own workspace might allow all non-admin users access to the same objects, like clusters, jobs, notebooks, and directories.
A larger team or organization with very sensitive information would likely want to use all of these access controls to enforce the principle of least privilege so that any individual user has access only to the resources for which they have a legitimate need.
For example, suppose that LargeCorp has three people who need access to a specific workspace folder (which contains notebooks and experiments) for the finance team. LargeCorp can use these APIs to grant directory access only to the finance data team group.
IAM role credential passthrough allows users to authenticate automatically to S3 buckets from Databricks clusters using the identity that they use to log in to Databricks. An admin creates IAM roles, maps Databricks users to appropriate roles, and assigns those IAM roles to a cluster. Commands that users run on that cluster can read and write data in S3 using their identity.
Alternatively, you can secure access to S3 buckets by configuring an AWS instance profile and assigning it access to a cluster. The disadvantage of this approach is that only one instance profile can be assigned to a cluster, and anyone who needs access to the cluster must have access to that instance profile. IAM role credential passthrough, on the other hand, allows multiple users with different data access policies to share one Databricks cluster to access data in S3 while always maintaining data security. Another advantage is data governance. IAM role credential passthrough associates a user with an identity. This in turn enables S3 object logging via CloudTrail. All S3 access is tied directly to the user via the ARN in CloudTrail logs.
Large companies like LargeCorp might want to put their secrets in AWS Systems Manager (formerly called EC2 SSM), then create an IAM role to access it, and then add that IAM role to the set of cluster IAM roles.
For details, see Access S3 buckets using IAM credential passthrough with Databricks SCIM.
You may also want to use the secret manager to set up secrets that you expect your notebooks to need. A secret is a key-value pair that stores secret material for an external data source or other calculation, with a key name unique within a secret scope.
You create secrets using either the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a notebook or job to read your secrets.
Alternatively, large companies like LargeCorp might want to put their secrets in AWS Systems Manager (formerly called EC2 SSM), then create an IAM role to access it, add that role to the cluster IAM roles. See Data source credential passthrough.
Using Databricks REST APIs, some of your security configuration tasks can be automated using Terraform or AWS Quick Start (CloudFormation) templates. These templates can be used to configure and deploy new workspaces as well as to update administrative configurations for existing workspaces. Particularly for large companies with dozens of workspaces, using templates can enable fast and consistent automated configurations.