Best practices: Data governance on Databricks

This document describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization. It demonstrates a typical deployment workflow you can employ using Databricks and cloud-native solutions to secure and monitor each layer from the application down to storage.

Why is data governance important?

Data governance is an umbrella term that encapsulates the policies and practices implemented to securely manage the data assets within an organization. As one of the key tenets of any successful data governance practice, data security is likely to be top of mind at any large organization. Key to data security is the ability for data teams to have superior visibility and auditability of user data access patterns across their organization. Implementing an effective data governance solution helps companies protect their data from unauthorized access and ensures that they have rules in place to comply with regulatory requirements.

Governance challenges

Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have the singular challenge of ensuring that this data is secure and is being managed according to the internal controls of the organization. Regulatory bodies the world over are changing the way we think about how data is both captured and stored. These compliance risks only add further complexity to an already tough problem. How then, do you open your data to those who can drive the use cases of the future? Ultimately, you should be adopting data policies and practices that help the business to realize value through the meaningful application of what can often be vast stores of data, stores that are growing all the time. We get solutions to the world’s toughest problems when data teams have access to many and disparate sources of data.

Typical challenges when considering the security and availability of your data in the cloud:

  • Do your current data and analytics tools support access controls on your data in the cloud? Do they provide robust logging of actions taken on the data as it moves through the given tool?
  • Will the security and monitoring solution you put in place now scale as demand on the data in your data lake grows? It can be easy enough to provision and monitor data access for a small number of users. What happens when you want to open up your data lake to hundreds of users? To thousands?
  • Is there anything you can do to be proactive in ensuring that your data access policies are being observed? It is not enough to simply monitor; that is just more data. If data availability is merely a challenge of data security, you should have a solution in place to actively monitor and track access to this information across the organization.
  • What steps can you take to identify gaps in your existing data governance solution?

How Databricks addresses these challenges

  • Access control: Rich suite of access control all the way down to the storage layer. Databricks can take advantage of its cloud backbone by utilizing state-of-the-art AWS security services right in the platform. Federate your existing AWS data access roles with your identity provider via SSO to simplify managing your users and their secure access to the data lake.
  • Cluster policies: Enable administrators to control access to compute resources.
  • API first: Automate provisioning and permission management with the Databricks REST API.
  • Audit logging: Robust audit logs on actions and operations taken across the workspace delivered to you.
  • AWS cloud native: Databricks can leverage the power of AWS CloudTrail and CloudWatch to provide data access information across your deployment account and any others you configure. You can then use this information to power alerts that tip us off to potential wrongdoing.

The following sections illustrate how to use these Databricks features to implement a governance solution.

Set up access control

To set up access control, you secure access to storage and implement fine-grained control of individual tables.

Secure S3 access using instance profiles

One way to secure data access data from your Databricks clusters is using instance profiles. As a prerequisite, your IAM role must have the necessary permissions to access the S3 bucket and you must add the instance profile to your Databricks workspace. When you launch a cluster, select the instance profile.

Instance profile

Once your cluster is created, make sure that authorized users have access to attach notebooks and use the cluster and that unauthorized access is denied:

Cluster permission

Now all users that have access to this cluster can also access the S3 bucket through the instance profile attached to the cluster.

This is a very simple and convenient way to have robust access control setup and securely manage access to sensitive data. You can also automate this workflow using the Instance Profiles API if that is your preferred method of scaling out as your user base grows.

Implement table access control

You can enable table access control on Databricks to programmatically grant, deny, and revoke access to your data from the Spark SQL API. You can control access to securable objects like databases, tables, views and functions. Consider a scenario where your company has a database to store financial data. You might want your analysts to create financial reports using that data. However, there might be sensitive information in another table in the database that analysts should not access. You can provide the user or group the privileges required to read data from one table, but deny all privileges to access the second table.

In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the Finance database. Alice then provides Oscar, an analyst, with the privileges required to read from shared_data but denies all privileges to private_data.

Grant select

Alice grants SELECT privileges to Oscar to read from shared_data:

Grant select table

Alice denies all privileges to Oscar to access private_data:

Deny statement

You can take this one step further by defining fine-grained access controls to a subset of a table or by setting privileges on derived views of a table.

Deny table

IAM credential passthrough

Let’s look at another scenario where Team Analysts need to read and write to an S3 bucket analyst_project_data. The setup for this is straightforward: configure a high concurrency cluster and attach an instance profile that has permissions to read and write to the S3 bucket. Next, give Team Analysts access to use this cluster. Now, what if access to this bucket needs to be further restricted such that only two analysts, Sarah and Alice, need to have write access to the analyst-project_data bucket and all the other users have only read access. In this case, you will need two clusters: one cluster with an instance profile attached that gives read access permissions to the bucket, and all analysts can use this cluster; a second cluster which only Sarah and Alice have access to, and this cluster will have another instance profile attached to it which gives them permissions to write to the S3 bucket. As data volumes grow and access policies in organizations become more complicated, it becomes cumbersome to maintain more clusters and IAM roles, with different access controls for each. Organizations sometimes have to choose between simplicity and governance, while still trying to meet their data security needs.

To solve this problem, Databricks provides IAM credential passthrough, which allows you to authenticate automatically to S3 buckets from Databricks clusters using the identity that you use to log in to Databricks. You can choose to enable credential passthrough when you configure a new cluster. The following image shows the option to enable credential passthrough on a high-concurrency cluster.

IAM passthrough HC

This option is different for a standard cluster where passthrough is limited to a single user.

IAM passthrough SU

You can secure access to S3 buckets using IAM credential passthrough with Databricks SCIM or leverage AWS Identity Federation with SAML 2.0 to provide a seamless way for organizations to centralize data access within their identity provider, letting Databricks pass the credentials down to the storage layer. As a result, multiple users with different data access policies can share one Databricks cluster and access data in S3 while always maintaining data security and governance.

Manage cluster configurations

Cluster policies allow Databricks administrators to define cluster attributes that are allowed on a cluster, such as instance types, number of nodes, custom tags, and many more. When an admin creates a policy and assigns it to a user or a group, those users can only create clusters based on the policy they have access to. This gives administrators a much higher degree of control on what types of clusters can be created.

You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or Cluster Policies API. A user can create a cluster only if they have the create_cluster permission or access to at least one cluster policy. Extending your requirements for the new analytics project team, as described above, administrators can now create a cluster policy and assign it to one or more users within the project team who can now create clusters for the team limited to the rules specified in the cluster policy. The image below provides an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy definition.

Cluster policy

Automatically provision clusters and grant permissions

With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to both provision and grant permission to cluster resources for users and groups at any scale. You can use the Clusters API to create and configure clusters for your specific use case.

Additionally, you can attach an instance profile to the cluster for direct access to any corresponding storage.

You can then use the Permissions API to apply access controls to the cluster.

Preview

The Permissions API is in Private Preview. To enable this feature for your Databricks workspace, contact your Databricks representative.

The following is an example of a configuration that might suit a new analytics project team.

The requirements are:

  • Support the interactive workloads of this team, who are mostly SQL and Python users.
  • Provision a data source in object storage with credentials that give the team access to the data tied to the role.
  • Ensure that users get an equal share of the cluster’s resources.
  • Provision larger, memory optimized instance types.
  • Grant permissions to the cluster such that only this new project team has access to it.
  • Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.

Deployment script

You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.

Provision cluster

Endpoint - https:///<databricks-instance>/api/2.0/clusters/create

Note

Cost control is enabled by using spot instances.

{
  "autoscale": {
     "min_workers": 2,
     "max_workers": 50
  },
  "cluster_name": "project team interactive cluster",
  "spark_version": "latest-stable-scala2.11",
  "spark_conf": {
     "spark.databricks.cluster.profile": "serverless",
     "spark.databricks.repl.allowedLanguages": "sql,python,r"
  },
  "aws_attributes": {
     "first_on_demand": 1,
     "availability": "SPOT_WITH_FALLBACK",
     "zone_id": "us-youst-2a",
     "instance_profile_arn": "arn:aws:iam::826763667205:instance-profile/test-iam-role",
     "spot_bid_price_percent": 100,
     "ebs_volume_type": "GENERAL_PURPOSE_SSD",
     "ebs_volume_count": 1,
     "ebs_volume_size": 100
  },
  "node_type_id": "r4.2xlarge",
  "ssh_public_keys": [],
  "custom_tags": {
     "ResourceClass": "Serverless",
     "team": "new-project-team"
  },
  "spark_env_vars": {
     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 60,
  "enable_elastic_disk": "true",
  "init_scripts": []
}

Grant cluster permission

Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
  "access_control_list": [
    {
      "group_name": "project team",
      "permission_level": "CAN_MANAGE"
    }
  ]
}

Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the project. There are additional configuration steps within your host cloud provider account required to implement this solution, though too, can be automated to meet the requirements of scale.

Audit access

Configuring access control in Databricks and controlling data access in storage is the first step towards an efficient data governance solution. However, a complete solution requires auditing access to data and providing alerting and monitoring capabilities. Databricks provides a comprehensive set of audit events to log activities provided by Databricks users, allowing enterprises to monitor detailed usage patterns on the platform. To get a complete understanding of what users are doing on the platform and what data is being accessed, you should use both native Databricks and cloud provider audit logging capabilities.

Make sure you configure audit logging. This involves configuring the right access policy so that Databricks can deliver audit logs to an S3 bucket that you provide. Audit logs are delivered periodically to the S3 bucket and you guarantee delivery of audit logs within 72 hours of day close.

Here is an example that illustrates how you can analyze audit logs in Databricks. Find all user accounts that logged into a workspace and from where. These logs are available in your S3 bucket, partitioned by date. Load the audit logs as a DataFrame in Databricks and register it as a temp table. Then query that table using the appropriate serviceName and actionName:

IAM passthrough SU

The output from that cell will look something like this:

Audit table

There are many more events you can audit. Configure audit logging provides the complete list of audit events, parameters, and audit log schemas.

Monitor the storage layer

You can extend this auditability into your storage layer by implementing native AWS services like CloudTrail and CloudWatch. These can help you to answer questions like:

  • Which data is being accessed in my data lake?
  • Are users writing data to sinks they shouldn’t be? Are they deleting data?
  • Are proper data access controls in place at the application layer?
  • Are there data sources I can retire or consolidate?

In a few steps, you can use CloudTrail to create a CloudWatch logs stream, which you can then use to create real time alerts. These alerts can notify data teams of corresponding actions taken by their users or other actors, and can be based on a variety of customizable conditions. In this way, you can closely monitor and maintain actions taken on S3 buckets across your data lake. Once these sorts of alerting processes are in place, it can be much easier to spot any gaps in your table or database access control. Similarly, you can use Databricks to ingest and transform your CloudTrail logs and perform your own up-to-the-minute streaming log analytics using Delta Lake tables.

Learn more

Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs:

If you use Immuta, you can benefit from their business partnership with Databricks to provide a native product integration across Unified Data Analytics and data governance. See End-to-end Data Governance with Databricks and Immuta.