Credential redaction

This article provides an overview of how Databricks redacts access keys and credentials in logs.

Credential redaction overview

Credential redaction is a critical security practice that involves masking sensitive information, such as passwords or API keys, to prevent unauthorized access. Databricks redacts keys and credentials in audit logs and log4j Apache Spark logs to protect your data from information leaking. Databricks automatically redacts cloud credentials and credentials in URI. Redaction is based on the value retrieved from the secret, regardless of the variable or context in which it's used.

For some credential types, Databricks adds a hash_prefix, which is a short code generated from the credential using a method called MD5. This code is used to check that the credential is valid and hasn't been altered.

Cloud credentials redaction

Cloud credentials redacted might have one of several redaction replacements. Some say [REDACTED], while others might have more specific replacements such as REDACTED_POSSIBLE_CLOUD_SECRET_ACCESS_KEY.

Databricks might redact certain long strings that appear randomly generated, even if they are not cloud credentials.

AWS access key redaction

For AWS access keys, Databricks searches for strings starting with AKIA and replace them with REDACTED_AWS_ACCESS_KEY(hash_prefix).

For example, Databricks logs 2017/02/08: Accessing AWS using AKIADEADBEEFDEADBEEF as 2017/01/08: Accessing AWS using REDACTED_AWS_ACCESS_KEY(655f9d2f)

AWS secret access key redaction

Databricks replaces a AWS secret access key with REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY without appending its hash.

For example, Databricks logs 2017/01/08: Accessing AWS using 99Abcdeuw+zXXAxllliupwqqqzDEUFdAtaBrickX as 2017/01/08: Accessing AWS using REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY.

Since AWS does not have an explicit identifier for secret access keys, it's possible that Databricks redacts some seemingly randomly-generated 40-characters long strings other than AWS secret access keys.

Credentials in URI redaction

Databricks detects //username:password@mycompany.com in URI and replaces username:password with REDACTED_CREDENTIALS(hash_prefix). Databricks computes the hash from username:password (including the :).

For example, Databricks logs 2017/01/08: Accessing https://admin:admin@mycompany.com as 2017/01/08: Accessing https://REDACTED_CREDENTIALS(d2abaa37)@mycompany.com.