Security within Databricks SQL requires administrators to configure access to S3 storage through an instance profile and data object owners to configure fine-grained access using Databricks table access control. Instance profiles allow you to access your data from SQL endpoints without the need to manage, deploy, or rotate AWS keys.
Databricks table access control is an expressive, cloud agnostic, and fine grained security model that provides end-to-end security on your data lake with auditability. Table access control allows setting fine-grained row and column level permissions using SQL GRANT statements. It is an open standard familiar to database and data warehouse users and allows data owners in each department to delegate data access without the need for complex cloud access control configuration.
This article gives an overview of table access control, provides the basic steps to configure table access control, and shows how to implement common patterns for granting access to data objects. It explains how to use credential passthrough for legacy implementations.
In this section:
Table access control enables you to secure the following objects:
CATALOG: controls access to the entire data catalog.
DATABASE: controls access to a database.
TABLE: controls access to a managed or external table.
VIEW: controls access to SQL views.
ANY FILE: controls access to the underlying filesystem. Users granted access to
ANY FILEcan bypass the restrictions put on the catalog, databases, tables, and views by reading from the filesystem directly.
Only Databricks administrators and object owners can grant access to securable objects. A user who creates a database, table, or view in Databricks SQL or using a cluster enabled for table access control becomes its owner. The owner is granted all privileges and can grant privileges to other users. If an object does not have an owner, an administrator can set object ownership. The following table summarizes the available roles and the objects for which each role can grant privileges.
|Role||Can grant access privileges for|
|Databricks administrator||All objects in the catalog and the underlying filesystem.|
|Catalog owner||All objects in the catalog.|
|Database owner||All objects in the database.|
|Table owner||Only the table.|
For more information, see Data object privileges.
This section describes the recommended steps for configuring table access control. It describes when steps are required or optional and the environments in which the steps are performed.
In this section:
- Databricks account on the Premium plan.
- Databricks workspace on the E2 version of the Databricks platform. For information about creating E2 workspaces, see Create and manage workspaces using the account console. All new Databricks accounts and most existing accounts are now E2. If you are not sure which account type you have, contact your Databricks representative.
- Administrator has the Databricks SQL entitlement. To grant the Databricks SQL entitlement:
- In the Databricks Data Science & Engineering workspace, go to the admin console.
- Click the Users tab.
- In the row for your account, click the Databricks SQL access checkbox.
- Click Confirm.
An administrator performs these steps in your IdP (if you use group synchronization) and the Data Science & Engineering workspace admin console.
This step is optional.
A Databricks administrator performs this step in a notebook in the Data Science & Engineering workspace.
Administrators set owners using ALTER statements. To programmatically generate the
ALTER statements required to change object ownership, an administrator can run the following notebook on a Databricks cluster enabled with table access control. The notebook queries the metastore for a set of databases and generates the
ALTER commands to assign ownership to the databases and the tables contained in the databases.
The simplest option is to set the owner to a group of admins. Alternatively, to enable a delegated security model, you can select different owners for each database, giving each the ability to manage permissions on the objects in the database.
An administrator performs these steps in AWS Console, the Data Science & Engineering workspace admin console, and the Databricks SQL admin settings.
For any data you want to be queried in Databricks SQL, an administrator must:
Configure an instance profile that grants access to the underlying storage.
Databricks SQL requires one instance profile with access to any data to be queried across all SQL endpoints, whereas in the Databricks Data Science & Engineering workspace, it is common to have several instance profiles, each with partial permissions. If you have an instance profile that provides global access already registered in Databricks you can reuse it.
Register that credential in Databricks SQL.
For details on both steps, see Configure an instance profile.
An administrator performs these steps in the Databricks SQL query editor.
Databricks SQL administrators and object owners use SQL statements to define access to datasets. This requires all datasets to be registered as tables in the metastore. You can skip this step if you have already created tables in your metastore. However, if the tables were defined using the Hive syntax, you must recreate them.
Start a SQL endpoint.
Run a query in the Databricks SQL query editor to create a table you want users to be able to query.
Example commands to be issued by an administrator user (or any user with
CREATE DATABASE sales; CREATE TABLE sales.purchases LOCATION "s3://mys3bucket/mytable";
Data object owners perform this step in the Databricks SQL query editor.
Data object owners grant privileges to users or groups by issuing GRANT statements. There are several ways to do this depending on the desired complexity of the permissions structure. Databricks recommends you use the groups defined in Step 1.
For each group of users, assign permissions to objects. It is common to do this at the database level. This could be as simple as an administrator or owner issuing the following command in Databricks SQL:
GRANT USAGE, SELECT, READ_METADATA ON DATABASE sales TO `analysts`;
This command gives read access to the
analystgroup on the
salesdatabase. Privileges are inherited, so granting read permission on the database allows read access to all the tables and views stored in the database, including any future objects added to the database. For a detailed explanation of the privileges that can be granted to users and groups, see Privileges.
For common patterns in setting up permissions, see Common patterns.
(Optional, but recommended). It is common to set up private user storage and team storage, which allow users to create their own tables in a sandbox area in which only they (or their team) have access. The following example creates a database called
user1can write data to.
CREATE DATABASE IF NOT EXISTS user1_sandbox LOCATION "s3a://mybucket/home/user1"; GRANT CREATE, USAGE ON DATABASE user1_sandbox TO `firstname.lastname@example.org`;
This command purposefully gives
USAGEpermission but does not make
user1the owner of the database. This allows
user1to read and write objects in the
user1_sandboxdatabase, but critically
user1cannot grant other users access to them, which could be used to circumvent access controls.
This section describes common patterns for granting access to data objects.
In this section:
To grant all Databricks SQL users read-only access to all objects registered in the metastore, an administrator issues the following command:
GRANT USAGE, SELECT, READ_METADATA ON CATALOG TO users;
Often administrators and users are accustomed to working with data access permissions at a group level. In the Databricks Data Science & Engineering workspace, this is typically achieved using an instance profile associated with a cluster that is scoped to allow only a particular group to attach to. To use this pattern, administrators perform the following steps:
For each cluster in the Data Science & Engineering workspace, note the set of users allowed to access the cluster (ideally qualified by the use of a group).
Examine the credentials on the cluster to determine the levels of data access that should be granted to the group for each database.
GRANTstatements, typically on the database:
GRANT USAGE, SELECT, READ_METADATA ON DATABASE telemetry TO `data_science`;
Any object added to this database will be accessible to the group.
To allow data sharing within the same team, you can implement team sandboxes:
CREATE DATABASE IF NOT EXISTS team1_sandbox LOCATION "s3a://mybucket/home/team1"; GRANT CREATE, SELECT, USAGE, READ_METADATA ON DATABASE team1_sandbox TO `team1`;
team1 is a group defined in Set up group synchronization or create groups. The team can safely share data in
team1_sandbox without the ability to share data outside the team.
Databricks includes two functions—
is_member—that allow you to express column- and row-level permissions dynamically in the body of a view definition.
These functions let you implement the following use cases:
- Column-level permissions
- Row-level permissions
- Data masking
For details, see Dynamic view functions.
Databricks supports credential passthrough to control access to cloud storage for limited access patterns. Credential passthrough is a vendor-specific implementation that allows user identity to be passed through to the cloud service storage provider which then verifies permissions on the files themselves.
Credential passthrough allows you to authenticate automatically to S3 buckets from Databricks compute resources using the identity that you use to log in to Databricks. Credential passthrough has two limitations:
- It does not provide fine grained—column or row level—security and as a result can be used only on direct file access.
- Users with passthrough privilege can bypass the restrictions put on tables by reading from the filesystem directly.
It is thus considered to be a legacy approach. Databricks recommends you choose other solutions when available.
There are two modes for accessing data using passthrough:
You can access path-based tables directly; Databricks automatically passes your user identity through as the account used to access the file. This works for direct path-based table access as follows:
SELECT * FROM delta.`s3:/.../myfolder`
This access pattern requires you to access files directly rather than use tables registered in the metastore. Thus you must know the explicit location of the data in the object store without the benefit of the schema browser.
This method does not require
ANY FILE permission.
To have a “cataloged” version of a passthrough table, use can use views. With this method however you have the burden of partition updates, schema drift, and keeping the view definition up to date.
CREATE VIEW v AS SELECT * FROM delta.`s3:/.../myfolder`
Views that use passthrough on path-based tables are not fully supported by all data types and formats.