Security

Databricks will redact audit logs and log4j logs in spark logs to protect your data from information leaking. Currently, Databricks redacts three types of credentials at logging time - AWS Access Key, AWS Secret Access Key, and credentials in URI. Upon detection of these secrets, Databricks will replace them with placeholders. The logging-time redaction is enabled in Databricks Runtime 3.0, Cluster Image Release 2.1.1-db4 and later. For some credential types, Databricks also appends a hash_prefix, which is the first 8 hex bytes of the md5 checksum of the credential for verification purpose. See examples below for details.

Redact AWS Access Key

For AWS Access Key, Databricks searches for strings start with AKIA and replace them with REDACTED_AWS_ACCESS_KEY(hash_prefix). For example, Databricks will log 2017/02/08: Accessing AWS using AKIADEADBEEFDEADBEEF as 2017/01/08: Accessing AWS using REDACTED_AWS_ACCESS_KEY(655f9d2f)

Redact AWS Secret Access Key

Databricks replaces AWS Secret Access Key with REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY without appending its hash. For example, Databricks will log 2017/01/08: Accessing AWS using 99Abcdeuw+zXXAxllliupwqqqzDEUFdAtaBrickX as 2017/01/08: Accessing AWS using REDACTED_POSSIBLE_AWS_SECRET_ACCESS_KEY.

Since AWS does not have an explicit identifier for Secret Access Keys, it’s possible that Databricks redacts some seemingly randomly-generated 40-characters long strings other than AWS Secret Access Keys.

Redact Credentials in URI

Databricks detects //username:password@mycompany.com in URI and replaces username:password with REDACTED_CREDENTIALS(hash_prefix). Note that Databricks computes the hash from username:password (including the :). For example, Databricks will log 2017/01/08: Accessing https://admin:admin@mycompany.com as 2017/01/08: Accessing https://REDACTED_CREDENTIALS(d2abaa37)@mycompany.com.