This article contains references to the term whitelist, a term that Databricks no longer uses. When the term is removed from the software, we’ll remove it from this article.
This article describes how to enable table access control for a cluster.
For information about how to set privileges on a data object once table access control has been enabled on a cluster, see Data object privileges.
Table access control is available in two versions:
- SQL-only table access control, which restricts users to SQL commands. You are restricted to the Apache Spark SQL API, and therefore cannot use Python, Scala, R, RDD APIs, or clients that directly read the data from cloud storage, such as DBUtils.
- Python and SQL table access control, which allows users to run SQL, Python, and PySpark commands. You are restricted to the Spark SQL API and DataFrame API, and therefore cannot use Scala, R, RDD APIs, or clients that directly read the data from cloud storage, such as DBUtils.
Even if table access control is enabled for a cluster, Databricks administrators have access to file-level data.
This version of table access control restricts users to SQL commands only.
To enable SQL-only table access control on a cluster and restrict that cluster to use only SQL commands, set the following flag in the cluster’s Spark conf:
Access to SQL-only table access control is not affected by the Enable Table Access Control setting in the admin console. That setting controls only the workspace-wide enablement of Python and SQL table access control.
This version of table access control lets users run Python commands that use the DataFrame API as well as SQL. When it is enabled on a cluster, users on that cluster:
- Can access Spark only using the Spark SQL API or DataFrame API. In both cases, access to tables and views is restricted by administrators according to the Databricks Data governance model.
- Must run their commands on cluster nodes as a low-privilege user forbidden from accessing sensitive parts of the filesystem or creating network connections to ports other than 80 and 443.
- Only built-in Spark functions can create network connections on ports other than 80 and 443.
- Only admin users or users with ANY FILE privilege can read data from external databases through the PySpark JDBC connector.
- If you want Python processes to be able to access additional outbound ports, you can set the Spark config
spark.databricks.pyspark.iptable.outbound.whitelisted.portsto the ports you want to allow access. The supported format of the configuration value is
[port[:port][,port[:port]]...], for example:
21,22,9000:9999. The port must be within the valid range, that is,
Attempts to get around these restrictions will fail with an exception. These restrictions are in place so that users can never access unprivileged data through the cluster.
Before users can configure Python and SQL table access control, a Databricks admin must:
- Enable table access control for the Databricks workspace.
- Deny users access to clusters that are not enabled for table access control. In practice, that means denying most users permission to create clusters and denying users the Can Attach To permission for clusters that are not enabled for table access control.
For information on both these requirements, see Enable table access control for your workspace.
When you create a cluster, click the Enable table access control and only allow Python and SQL commands option. This option is available only for High Concurrency clusters.
To create the cluster using the REST API, see Create cluster enabled for table access control example.