description: Learn best practices when configuring and using Databricks pools.
This article describes how you can use Delta Lake on Databricks to manage General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) compliance for your data lake. Because Delta Lake adds a transactional layer that provides structured data management on top of your data lake, it can dramatically simplify and speed up your ability to locate and remove personal information (also known as “personal data”) in response to consumer GDPR or CCPA requests.
Your organization may manage hundreds of terabytes worth of personal information in your cloud. Bringing these datasets into GDPR and CCPA compliance is of paramount importance, but this can be a big challenge, especially for larger datasets stored in data lakes.
The challenge typically arises from the following factors:
- When you have large amounts (petabyte scale) of data in the cloud, user data can be stored and distributed across multiple datasets and locations.
- Point or ad-hoc queries to find data for specific users is expensive (akin to finding a needle in a haystack), because it often requires full table scans. Taking a brute force approach to GDPR/CCPA compliance can result in multiple jobs operating over different tables, resulting in weeks of engineering and operational effort.
- Data lakes are inherently append-only and do not support the ability to perform row level “delete” or “update” operations natively, which means that you must rewrite partitions of data. Typical data lake offerings do not provide ACID transactional capabilities or efficient methods to find relevant data. Moreover, read/write consistency is also a concern: while user data is being redacted from the data lake, processes that read data should be protected from material impacts the way they would with a traditional RDBMS.
- Data hygiene in the data lake is challenging, given that data lakes by design support availability and partition tolerance with eventual consistency. Enforceable and rigorous practices and standards are required to assure cleansed data.
As a result, organizations that manage user data at this scale often end up writing computationally difficult, expensive, and time-consuming data pipelines to deal with GDPR and CCPA. For example, you might upload portions of your data lake into proprietary data warehousing technologies, where GDPR and CCPA compliance-related deletion activities are performed. This adds complexity and reduces data fidelity by forcing multiple copies of the data. Moreover, exporting data from such warehouse technologies back into a data lake may require re-optimization to improve query performance. This too results in multiple copies of data being created and maintained.
To resolve the issues listed above, the optimal approach to making a data lake GDPR- and CCPA-compliant requires:
- “Pseudonymization,” or reversible tokenization of personal information elements (
identifiers) to keys (
pseudonyms) that cannot be externally identified.
- Storage of information in a manner linked to pseudonyms rather than identifiers;
- Maintenance of strict access and use policies on the combination of the identifiers and pseudonyms;
- Pipelines or bucket policies to remove raw data on timelines that help you comply with applicable law;
- Structuring pipelines to locate and remove the identifier to destroy the linkage between the pseudonyms and identifiers
- ACID capabilities overlaid on top of the data lake to prevent readers from being negatively affected when delete or update operations are carried out on the data lake.
- High-performance pipeline, supporting, for example, the cleanup of 5TB of data within 10 minutes.
Delta Lake is a very effective tool for addressing these GDPR and CCPA compliance requirements, because its structured data management system adds transactional capabilities to your data lake. Delta Lake’s well-organized, well-sized, well-indexed, stats-enabled datasets enable quick and easy search, modification, and cleanup of your data using standard SQL DML statements like
The two use cases described in the following sections demonstrate how to convert your existing data to Delta Lake and how to delete and clean up personal information quickly and efficiently. This article also suggests options for pseudonymizing personal information and for improving query performance with Delta Lake.
This use case demonstrates how efficient Delta Lake can be when deleting personal data from your data lake.
The workflow described in this article references a database
gdpr containing a sample dataset with 65,000,000 rows and as many distinct customer IDs, amounting to 3.228 GB of data. Customer personal information is captured in the
customers table in this database.
The schema of the
gdpr.customers table is:
|-- c_customer_sk: integer (nullable = true) |-- c_customer_id: string (nullable = true) |-- c_current_cdemo_sk: integer (nullable = true) |-- c_current_hdemo_sk: integer (nullable = true) |-- c_current_addr_sk: integer (nullable = true) |-- c_first_shipto_date_sk: integer (nullable = true) |-- c_first_sales_date_sk: integer (nullable = true) |-- c_salutation: string (nullable = true) |-- c_first_name: string (nullable = true) |-- c_last_name: string (nullable = true) |-- c_preferred_cust_flag: string (nullable = true) |-- c_birth_day: integer (nullable = true) |-- c_birth_month: integer (nullable = true) |-- c_birth_year: integer (nullable = true) |-- c_birth_country: string (nullable = true) |-- c_email_address: string (nullable = true) |-- c_last_review_date: string (nullable = true)
The list of customers requesting to be forgotten per GDPR and CCPA come from a transactional database table,
gdpr.customer_delete_keys, that is populated using an online portal. The keys (distinct users) to be deleted represent roughly 10% (337.615 MB) of the original keys sampled from the original dataset in
The schema of the
gdpr.customer_delete_keys table contains the following fields:
|-- c_customer_sk: integer (nullable = true) |-- c_customer_id: string (nullable = true)
c_customer_id identifies customers to be deleted.
To get started with Delta Lake, you need to ingest your raw data (Parquet, CSV, JSON, and so on) and write it out as managed Delta tables. If your data is already in Parquet format, you can use
CONVERT TO DELTA to convert the Parquet files in place to Delta tables without rewriting any data. If not, you can use the Apache Spark APIs you’re familiar with to rewrite the format to Delta. Because Delta Lake uses Parquet, which is an open file format, your converted data won’t be locked in: you can quickly and easily convert your data back into another format if you need to.
This example converts the Parquet table
customers in the
CONVERT TO DELTA gdpr.customers
After you convert your tables to Delta Lake, you can delete the personal information of the users who have requested to be forgotten.
The following example involves a straightforward delete of customer personal data from the
customers table. A better practice is to pseudonymize all customer personal information in your working tables (prior to receiving a data subject request) and delete the customer entry from the “lookup table” that maps the customer to the pseudonym, while ensuring that data in working tables cannot be used to reconstruct the customer’s identity. For details, see Pseudonymize data.
The following examples make reference to performance numbers as a way of illustrating the impact of certain performance options. These numbers were recorded on the dataset described above, on a cluster with 3 worker nodes, each with 90 GB memory and 12 cores; the driver had 30GB memory and 4 cores.
Here is a simple Delta Lake
DELETE FROM operation, deleting the customers included in the
customer_delete_keys table from our sample
DELETE FROM `gdpr.customers` AS t1 WHERE EXISTS (SELECT c_customer_id FROM gdpr.customer_delete_keys WHERE t1.c_customer_id = c_customer_id)
During testing, this operation took too long to complete: finding files took 32 seconds and rewriting files took 2.6 min. To reduce the time to find the relevant files, you can increase the broadcast threshold:
set spark.sql.autoBroadcastJoinThreshold = 104857600;
This broadcast hint instructs Spark to broadcast each specified table when joining it with another table or view. This setting dropped file-finding to 8 seconds and writing to 1.6 minutes.
You can speed up performance even more with Delta Lake Z-Ordering (multi-dimensional clustering). Z-Ordering creates a range partition-based arrangement of data and indexes this information in the Delta table. Delta Lake uses this z-index to find files impacted by the
To take advantage of Z-Ordering, you must understand how the data you expect to be deleted is spread across the target table. For example, if the data, even for a few keys, is spread across 90% of the files for the dataset, you’ll be rewriting more than 90% of your data. Z-Ordering by relevant key columns reduces the number of files touched and can make rewrites much more efficient.
In this case, you should Z-Order by the
c_customer_id column before running delete:
OPTIMIZE gdpr.customers Z-ORDER BY c_customer_id
After Z-Ordering, finding files took 7 secs and writing dropped to 50 seconds.
Depending on how long after a consumer request you delete your data and on your underlying data lake, you may need to delete table history and underlying raw data.
By default, Delta Lake retains table history for 30 days and makes it available for “time travel” and rollbacks. That means that, even after you have deleted personal information from a Delta table, users in your organization may be able to view that historical data and roll back to a version of the table in which the personal information is still stored. If you determine that GDPR or CCPA compliance requires that these stale records be made unavailable for querying before the default retention period is up, you can use the VACUUM function to remove files that are no longer referenced by a Delta table and are older than a specified retention threshold. Once you have removed table history using the VACUUM command, all users lose the ability to view that history and roll back.
To delete all customers who requested that their information be deleted, and then remove all table history older than 7 days, you simply run:
To remove artifacts younger than 7 days, use the
RETAIN num HOURS option:
VACUUM gdpr.customers RETAIN 100 HOURS
In addition, if you created Delta tables using Spark APIs to rewrite non-Parquet files to Delta (as opposed to converting Parquet files to Delta Lake in-place), your raw data may still contain personal information that you have deleted or anonymized. Databricks recommends that you set up a retention policy with your cloud provider of thirty days or less to remove raw data automatically.
While the deletion method described above can, strictly, permit your organization to comply with the GDPR and CCPA requirement to perform deletions of personal information, it comes with a number of downsides. The first is that the GDPR does not permit any additional processing of personal information once a valid request to delete has been received. As a consequence, if the data is not stored in a pseudonymized fashion—that is, replacing personally identifiable information with an artificial identifier or pseudonym—prior to the receipt of the data subject request, you are obligated to simply delete all of the linked information. If, however, you have previously pseudonymized the underlying data, your obligations to delete are satisfied by the simple destruction of any record that links the identifier to the pseudonym (assuming the remaining data is not itself identifiable), and you may retain the remainder of the data.
In a typical pseudonymization scenario, you keep a secured “lookup table” that maps the customer’s personal identifiers (name, email address, etc) to the pseudonym. This has the advantage not only of making deletion easier, but also of allowing you to “restore” the user identity temporarily to update user data over time, an advantage denied in an anonymization scenario, in which by definition a customer’s identity can never be restored, and all customer data is by definition static and historical.
For a simple pseudonymization example, consider the customer table updated in the deletion example. In the pseudonymization scenario, you can create a
gdpr.customers_lookup table that contains all customer data that could be used to identify the customer, with an additional column for a pseudonymized email address. Now, you can use the pseudo email address as the key in any data tables that reference customers, and when there is a request to forget this information, you can simply delete that information from the
gdpr.customers_lookup table and the rest of the information can remain non-identifiable forever.
The schema of the
gdpr.customers_lookup table is:
|-- c_customer_id: string (nullable = true) |-- c_email_address: string (nullable = true) |-- c_email_address_pseudonym: string (nullable = true) |-- c_first_name: string (nullable = true) |-- c_last_name: string (nullable = true)
In this scenario, put the remaining customer data, which cannot be used to identify the customer, in a pseudonymized table called
|-- c_email_address_pseudonym: string (nullable = true) |-- c_customer_sk: integer (nullable = true) |-- c_current_cdemo_sk: integer (nullable = true) |-- c_current_hdemo_sk: integer (nullable = true) |-- c_current_addr_sk: integer (nullable = true) |-- c_first_shipto_date_sk: integer (nullable = true) |-- c_first_sales_date_sk: integer (nullable = true) |-- c_salutation: string (nullable = true) |-- c_preferred_cust_flag: string (nullable = true) |-- c_birth_year: integer (nullable = true) |-- c_birth_country: string (nullable = true) |-- c_last_review_date: string (nullable = true)
A strong way to pseudonymize personal information is one-way cryptographic hashing and salting with a remembered salt or salts. Hashing turns data into a fixed-length fingerprint that cannot be computationally reversed. Salting adds a random string to the data that will be hashed as a way to frustrate attackers who are using lookup or “rainbow” tables that contain hashes of millions of known email addresses or passwords.
You can salt the column
c_email_address by adding a random secret string literal before hashing. This secret string can be stored using Databricks secrets to add additional security to your salt. If unauthorized Databricks users try to access the secret, they will see redacted values.
dbutils.secrets.get(scope = "salt", key = "useremail")
res0: String = [REDACTED]
This is a simple example to illustrate salting. Using the same salt for all of your customer keys is not a good way to mitigate attacks; it just makes the customer keys longer. A more secure approach would be to generate a random salt for each user. See Make your pseudonymization stronger.
Once you salt the column
c_email_address, you can hash it and add the hash to the
gdpr.customers_lookup table as
UPDATE gdpr.customers_lookup SET c_email_address_pseudonym = sha2(c_email_address,256)
Now you can use this value for all of your customer-keyed tables.
To reduce the risk that a compromise of a single salt could have on your database, it is advisable where practical to use different salts (one per customer, or even per user). Provided that the data attached to the pseudonymous identifier does not itself contain any information that can identify an individual, if you delete your record of which salt is related to which user and cannot recreate it, the remaining data should be rendered fully anonymous and therefore fall outside of the scope of the GDPR and the CCPA. Many organizations choose to create multiple salts per user and create fully anonymized data outside of the scope of data protection law by rotating these salts periodically according to business need.
And don’t forget that whether data is “personal” or “identifiable” is not an element-level analysis, but essentially an array-level analysis. So while obvious things like email addresses are clearly personal, combinations of things that by themselves would not be personal can also be personal. See, for example https://aboutmyinfo.org/identity/about: based on an analysis of the 1990 US Census, 87% of the United States population is uniquely identifiable by the three attributes of zip code, date of birth, and gender. So when you’re deciding what should be stored as part of the personal identifiers table or the working tables with only pseudonymous information, make sure to think about whether or not the collision of the seemingly non-identifiable information might itself be identifiable. And make sure for your own privacy compliance that you have internal processes that prevent attempts to re-identify individuals with the information you intended to be non-identifiable (for example differential privacy, privacy preserving histograms, etc.). While it may never be possible to completely prevent re-identification, following these steps will go a long way towards helping.
Step 2: Perform deletes showed how to improve Delta Lake query performance by increasing the broadcast threshold and Z-Ordering, and there are additional performance improvement practices that you should also be aware of:
- Ensure that key columns are within the first 32 columns in a table. Delta Lake collects stats on the first 32 columns, and these stats help with identification of files for deletion or update.
- Use the Auto Optimize feature, available in Delta Lake on Databricks, which automatically compacts small files during individual writes to a Delta table and offers significant benefits for tables that are queried actively, especially in situations when Delta Lake would otherwise be encountering multiple small files. See Auto Optimize for guidance about when to use it.
- Reduce the size of the source table (for the
BroadcastHashJoin). This helps Delta Lake leverage dynamic file pruning when determining relevant data for deletes. This will help particularly if the delete operations are not on partition boundaries.
- For any modify operation, such as
DELETE, be as specific as possible, providing all of your qualifying conditions in the search clause. This narrows down the number of files you hit and prevents transaction conflicts.
- Continuously tune for Spark shuffle, cluster utilization, and storage system optimal write throughout.
To learn more about Delta Lake on Databricks, see Delta Lake and Delta Engine guide.
For blogs about using Delta Lake for GDPR and CCPA compliance written by Databricks experts, see:
- How to Avoid Drowning in GDPR Data Subject Requests in a Data Lake
- Make Your Data Lake CCPA Compliant with a Unified Approach to Data and Analytics
- Efficient Upserts into Data Lakes with Databricks Delta
To learn about purging personal information in the Databricks workspace, see Manage workspace storage.