Configure audit logging

Note

This feature is available on the Premium plan and above.

Databricks provides access to audit logs of activities performed by Databricks users, allowing your enterprise to monitor detailed Databricks usage patterns.

Configure audit log delivery

Preview

Audit log delivery is in Public Preview.

As a Databricks Account Owner, you can configure low-latency delivery of audit logs in JSON file format to an AWS S3 storage bucket, where you can make the data available for usage analysis. Databricks delivers a separate JSON file for each workspace in your account with incurred usage, approximately every few minutes. Auditable events are typically logged within 15 minutes. For the file naming, delivery rules, and schema, see Audit delivery details and format.

The API to configure low-latency delivery of audit logs is Account API, which is the same API used to configure billable usage log delivery.

Configuration options

To configure audit log delivery, you have the following options.

If you have one workspace in your Databricks account:

  • Follow the instructions in the sections that follow, creating a single configuration object with a common configuration for your workspace.

If you have multiple workspaces in the same Databricks account, created using the Account API, you can do any of the following:

  • Share the same configuration (log delivery S3 bucket and IAM role) for all workspaces in the account. This is the default.
  • Use separate configurations for each workspace in the account.
  • Use separate configurations for different groups of workspaces, each sharing a configuration.

If you have multiple workspaces, each associated with a separate Databricks account (that is, they were not created using the Account API), you must create unique storage and credential configuration objects for each account, but you can reuse an S3 bucket or IAM role between these configuration objects.

Note

Even though you use the Account API to configure log delivery, you can configure log delivery for any workspace, including workspaces that were not created using the Account API.

High-level flow

The high-level flow of audit log delivery:

  1. Configure storage. In AWS, create a new AWS S3 bucket.

    Using Databricks APIs, call the Account API to create a storage configuration object that uses the bucket name.

  2. Configure credentials. In AWS, create the appropriate AWS IAM role.

    Using Databricks APIs, call the Account API to create a credentials configuration object that uses the IAM role’s ARN. The role policy can specify a path prefix for log delivery within your S3 bucket. You can choose to define an IAM role to include multiple path prefixes if you want log delivery configurations for different workspaces that share the the S3 bucket but use different path prefixes.

  3. Call the log delivery API.

    Using Databricks APIs, call the Account API to create an audit log delivery configuration that uses the credential and storage configuration objects from previous steps. This step lets you specify if you want to associate the log delivery configuration for all workspaces (current and future) in your account or for a specific set of workspaces.

After you complete these steps, you can access the JSON files. The delivery location is <bucket-name>/<delivery-path-prefix>/workspaceId=<workspaceId>/date=<yyyy-mm-dd>/auditlogs_<internal-id>.json. New JSON files are delivered every few minutes, potentially overwriting existing files for each workspace. Auditable events are typically logged within 15 minutes.

For more information about accessing these files and analyzing them using Databricks, see Analyze audit logs.

Important

There is a limit to the number of log delivery configurations that you can create for an account. You can create a maximum of two enabled configurations that use the account level (no workspace filter) and two enabled configurations that use the workspace filter. You cannot delete a log delivery configuration, but you can disable it.

Requirements

  • Account owner email address and password to authenticate with the APIs. The email address and password are both case sensitive.

  • Account ID. If you are participating in the Account API Public Preview to create workspaces and want to use the account you were assigned for the preview, get the account ID from your invitation email. For all other accounts, get your workspace account ID from the Usage Overview tab.

    Important

    Contact your Databricks representative if you don’t know your account ID.

How to authenticate to the APIs

The APIs described in this article are published on the accounts.cloud.databricks.com base endpoint for all AWS regional deployments.

Use the following base URL for API requests: https://accounts.cloud.databricks.com/api/2.0/

This REST API requires HTTP basic authentication, which involves setting the HTTP header Authorization. In this article, username refers to your account owner email address. The email address is case sensitive. There are several ways to provide your credentials to tools such as curl.

  • Pass your username and account password separately in the headers of each request in <username>:<password> syntax.

    For example:

    curl -X GET -u `<username>:<password>` -H "Content-Type: application/json" \
     'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/<endpoint>'
    
  • Apply base64 encoding to your <username>:<password> string and provide it directly in the HTTP header:

    curl -X GET -H "Content-Type: application/json" \
      -H 'Authorization: Basic <base64-username-pw>'
      'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/<endpoint>'
    
  • Create a .netrc file with machine, login, and password properties:

    machine accounts.cloud.databricks.com
    login <username>
    password <password>
    

    To invoke the .netrc file, use -n in your curl command:

    curl -n -X GET 'https://accounts.cloud.databricks.com/api/2.0/accounts/<account-id>/workspaces'
    

    This article’s examples use this authentication style.

For the complete API reference, see Account API.

Step 1: Configure storage

Databricks delivers the log to an S3 bucket in your account. You can configure multiple workspaces to use a single S3 bucket, or you can define different workspaces (or groups of workspaces) to use different buckets.

This procedure describes how to set up a single configuration object with a common configuration for one or more workspaces in the account. To use different storage locations for different workspaces, repeat the procedures in this article for each workspace or group of workspaces.

  1. Create the S3 bucket, following the instructions in Configure AWS storage (Account API).

  2. Create a Databricks storage configuration record that represents your new S3 bucket. Specify your S3 bucket by calling the create new storage configuration API (POST /accounts/<account-id>/storage-configurations).

    Pass the following:

    • storage_configuration_name — New unique storage configuration name.
    • root_bucket_info — A JSON object that contains a bucket_name field that contains your S3 bucket name.

    Copy the storage_configuration_id value returned in the response body. You will use it to create the log delivery configuration in a later step.

    For example:

    curl -X POST -n \
        'https://accounts.cloud.databricks.com/api/2.0/accounts/<databricks-account-id>/storage-configurations' \
      -d '{
        "storage_configuration_name": "databricks-workspace-storageconf-v1",
        "root_bucket_info": {
          "bucket_name": "my-company-example-bucket"
        }
      }'
    

    Response:

    {
      "storage_configuration_id": "<databricks-storage-config-id>",
      "account_id": "<databricks-account-id>",
      "root_bucket_info": {
        "bucket_name": "my-company-example-bucket"
      },
      "storage_configuration_name": "databricks-workspace-storageconf-v1",
      "creation_time": 1579754875555
    }
    

Step 2: Configure credentials

This procedure describes how to set up a single configuration object with a common configuration for one or more workspaces in the account. To use different credentials for different workspaces, repeat the procedures in this article for each workspace or group of workspaces.

Note

To use different S3 bucket names, you need to create separate IAM roles.

  1. Log into your AWS Console as a user with administrator privileges and go to the IAM service.

  2. Click the Roles tab in the sidebar.

  3. Click Create role.

    1. In Select type of trusted entity, click AWS service.

    2. In Common Use Cases, click EC2.

    3. Click the Next: Permissions button.

    4. Click the Next: Tags button.

    5. Click the Next: Review button.

    6. In the Role name field, enter a role name.

      Role name
    7. Click Create role. The list of roles displays.

  4. In the list of roles, click the role you created.

  5. Add an inline policy.

    1. On the Permissions tab, click Add inline policy.

      Inline policy
    2. In the policy editor, click the JSON tab.

      JSON editor
    3. Copy this access policy and modify it. Replace the following values in the policy with your own configuration values:

      • <s3-bucket-name> — The bucket name of your AWS S3 bucket.
      • <s3-bucket-path-prefix> — (Optional) The path to the delivery location in the S3 bucket. If unspecified, the logs are delivered to the root of the bucket. This path must match the delivery_path_prefix argument when you call the log delivery API.
      {
        "Version":"2012-10-17",
        "Statement":[
          {
            "Effect":"Allow",
            "Action":[
              "s3:GetBucketLocation"
            ],
            "Resource":[
              "arn:aws:s3:::<s3-bucket-name>"
            ]
          },
          {
            "Effect":"Allow",
            "Action":[
              "s3:PutObject",
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObjectAcl",
              "s3:AbortMultipartUpload"
            ],
            "Resource":[
              "arn:aws:s3:::<s3-bucket-name>/<s3-bucket-path-prefix>/",
              "arn:aws:s3:::<s3-bucket-name>/<s3-bucket-path-prefix>/*"
            ]
          },
          {
            "Effect":"Allow",
            "Action":[
              "s3:ListBucket",
              "s3:ListMultipartUploadParts",
              "s3:ListBucketMultipartUploads"
            ],
            "Resource":"arn:aws:s3:::<s3-bucket-name>",
            "Condition":{
              "StringLike":{
                "s3:prefix":[
                  "<s3-bucket-path-prefix>",
                  "<s3-bucket-path-prefix>/*"
                ]
              }
            }
          }
        ]
      }
      

      You can customize the policy usage of the path prefix:

      • If you do not want to use the bucket path prefix, remove <s3-bucket-path-prefix>/ (including the final slash) from the policy each time it appears.
      • If you want log delivery configurations for different workspaces that share the the S3 bucket but use different path prefixes, you can define an IAM role to include multiple path prefixes. There are two separate parts of the policy that reference <s3-bucket-path-prefix>. In each case, duplicate the two adjacent lines that reference the path prefix. Repeat each pair of lines for every new path prefix, for example:
      {
        "Resource":[
          "arn:aws:s3:::<mybucketname>/field-team/",
          "arn:aws:s3:::<mybucketname>/field-team/*",
          "arn:aws:s3:::<mybucketname>/finance-team/",
          "arn:aws:s3:::<mybucketname>/finance-team/*"
        ]
      }
      
    4. Click Review policy.

    5. In the Name field, enter a policy name.

    6. Click Create policy.

    7. If you use service control policies to deny certain actions at the AWS account level, ensure that sts:AssumeRole is whitelisted so Databricks can assume the cross-account role.

  6. On the role summary page, click the Trust Relationships tab.

  7. Paste this access policy into the editor and replace the following values in the policy with your own configuration values:

    <databricks-account-id> — Your Databricks account ID.

    {
      "Version":"2012-10-17",
      "Statement":[
        {
          "Effect":"Allow",
          "Principal":{
            "AWS":"arn:aws:iam::414351767826:role/SaasUsageDeliveryRole-prod-IAMRole-3PLHICCRR1TK"
          },
          "Action":"sts:AssumeRole",
          "Condition":{
            "StringEquals":{
              "sts:ExternalId":[
                "<databricks-account-id>"
              ]
            }
          }
        }
      ]
    }
    
  8. In the role summary, copy the Role ARN and save it for a later step.

    Role ARN
  9. Create a Databricks credentials configuration ID for your AWS role. Call the Create credential configuration API (POST /accounts/<account-id>/credentials). This request establishes cross-account trust and returns a reference ID to use when you create a new workspace.

    Replace <account-id> with your Databricks account ID. In the request body:

    • Set credentials_name to a name that is unique within your account.
    • Set aws_credentials to an object that contains an sts_role property. That object must specify the role_arn for the role you’ve created.

    The response body will include a credentials_id field, which is the Databricks credentials configuration ID that you need to create the new workspace. Copy this field so you can use it to create the log delivery configuration in a later step.

    For example:

     curl -X POST -n \
       'https://accounts.cloud.databricks.com/api/2.0/accounts/<databricks-account-id>/credentials' \
       -d '{
       "credentials_name": "databricks-credentials-v1",
       "aws_credentials": {
         "sts_role": {
           "role_arn": "arn:aws:iam::<aws-account-id>:role/my-company-example-role"
         }
       }
     }'
    

    Example response:

     {
       "credentials_id": "<databricks-credentials-id>",
       "account_id": "<databricks-account-id>",
       "aws_credentials": {
         "sts_role": {
           "role_arn": "arn:aws:iam::<aws-account-id>:role/my-company-example-role",
           "external_id": "<databricks-account-id>"
         }
       },
       "credentials_name": "databricks-credentials-v1",
       "creation_time": 1579753556257
     }
    

    Copy the credentials_id field from the response for later use.

Step 3: Call the log delivery API

To configure log delivery, call the Log delivery configuration API (POST /accounts/<account-id>/log-delivery).

You need the following values that you copied in the previous steps:

  • credentials_id — Your Databricks credential configuration ID, which represents your cross-account role credentials.
  • storage_configuration_id — Your Databricks storage configuration ID, which represents your root S3 bucket.

Also set the following fields:

  • log_type — Always set to AUDIT_LOG.

  • output_format — Always set to JSON. For the schema, see Audit log schema.

  • delivery_path_prefix — (Optional) Set to the path prefix. This must match the path prefix that you used in your role policy. The delivery path is <bucket-name>/<delivery-path-prefix>/workspaceId=<workspaceId>/date=<yyyy-mm-dd>/audit_log_<internal-id>.json.

  • workspace_ids_filter — (Optional) By default, this log configuration applies to all workspaces associated with your account ID. For some types of deployments there is only one workspace per account ID so this field is unnecessary. If your account was created originally for workspace creation with the Account API, you may have multiple workspaces associated with your account ID. You can optionally set this field to array of workspace IDs that this configuration applies to. If you plan to use different log delivery configurations for different workspaces, set this explicitly rather than leaving it blank. If you leave this blank and your account ID is associated in the future with additional workspaces, this configuration also applies to the new workspaces. A workspace might apply to more than one log delivery configuration, in which case the logs are written to multiple locations.

    Important

    There is a limit to the number of log delivery configurations that you can create for an account. You can create a maximum of two enabled configurations that use the account level (no workspace filter) and two enabled configurations that use the workspace filter. There is an additional uniqueness constraint that two enabled configurations cannot share all of their fields (not including the config_name). You cannot delete a log delivery configuration but you can disable it. You can re-enable a disabled configuration, but it will fail if it violates the maximum per type (account or explicit workspace filter) or if the uniqueness constraint is violated.

For example:

curl -X POST -n \
  'https://accounts.cloud.databricks.com/api/2.0/accounts/<databricks-account-id>/log-delivery' \
  -d '{
  "log_delivery_configuration": {
    "log_type": "AUDIT_LOG",
    "config_name": "audit log config",
    "output_format": "JSON",
    "credentials_id": "<databricks-credentials-id>",
    "storage_configuration_id": "<databricks-storage-config-id>",
    "delivery_path_prefix": "auditlogs-data",
    "workspace_ids_filter": [
        6383650456894062,
        4102272838062927
    ]
    }
}'

Example response:

{
    "log_delivery_configuration": {
        "config_id": "<config-id>",
        "config_name": "audit log config",
        "log_type": "AUDIT_LOG",
        "output_format": "JSON",
        "account_id": "<account-id>",
        "credentials_id": "<databricks-credentials-id>",
        "storage_configuration_id": "<databricks-storage-config-id>",
        "workspace_ids_filter": [
            6383650456894062,
            4102272838062927
        ],
        "delivery_path_prefix": "auditlogs-data",
        "status": "ENABLED",
        "creation_time": 1591638409000,
        "update_time": 1593108904000
    }
}

Additional features of the log delivery APIs

The log delivery APIs have additional features. See the API reference documentation for details.

Additional operations include:

  • Get all log delivery configurations

  • Get a log delivery configuration by ID

  • Enable or disable a log delivery configuration by ID.

    You cannot delete a log delivery configuration, but you can disable a configuration that you no longer need. This is important because there is a limit to the number of enabled log delivery configurations that you can create for an account. You can create a maximum of two enabled configurations that use the account level (no workspace filter) and two enabled configurations that use the workspace filter. There is an additional uniqueness constraint that two enabled configurations cannot share all of their fields (not including the config_name). You can re-enable a disabled configuration, but it will fail if it violates the maximum per type (account or explicit workspace filter) or if the uniqueness constraint is violated.

Audit delivery details and format

Once logging is enabled for your account, Databricks automatically starts sending audit logs in human-readable format to your delivery location on a periodic basis. Logs are available within 15 minutes of activation for audit logs configured using the Account API.

  • Encryption: Databricks encrypts audit logs using Amazon S3 server-side encryption.
  • Format: Databricks delivers audit logs in JSON format.
  • Location: The delivery location is <bucket-name>/<delivery-path-prefix>/workspaceId=<workspaceId>/date=<yyyy-mm-dd>/auditlogs_<internal-id>.json. New JSON files are delivered every few minutes, potentially overwriting existing files for each workspace. The delivery path is defined as part of the configuration.
    • Databricks can overwrite the delivered log files in your bucket at any time. If a file is overwritten, the existing content remains, but there may be additional lines for more auditable events.
    • Overwriting ensures exactly-once semantics without requiring read or delete access to your account.
  • Latency: Auditable Databricks events are typically logged within 15 minutes.

Audit log schema

The schema of audit log records is as follows. This section applies to both the Public Preview log delivery with Account API and the legacy audit logs.

  • version: the schema version of the audit log format
  • timestamp: UTC timestamp of the action
  • sourceIPAddress: the IP address of the source request
  • userAgent: the browser or API client used to make the request
  • sessionId: session ID of the action
  • userIdentity: information about the user that makes the requests
    • email: user email address
  • serviceName: the service that logged the request
  • actionName: the action, such as login, logout, read, write, and so on
  • requestId: unique request ID
  • requestParams: parameter key-value pairs used in the audited event
  • response: response to the request
    • errorMessage: the error message if there was an error
    • result: the result of the request
    • statusCode: HTTP status code that indicates the request succeeds or not

Audit events

The serviceName and actionName properties identify an audit event in an audit log record. The naming convention follows the Databricks REST API 2.0. This section applies to both the Public Preview log delivery with Account API and the legacy audit logs.

Databricks provides audit logs for the following services:

  • accounts
  • clusters
  • dbfs
  • genie
  • globalInitScripts
  • groups
  • iamRole
  • instancePools
  • jobs
  • mlflowExperiment
  • notebook
  • secrets
  • sqlPermissions, which has all the audit logs for table access when table ACLs are enabled.
  • ssh
  • workspace

Note

  • If actions take a long time, the request and response are logged separately but the request and response pair have the same requestId.
  • With the exception of mount-related operations, Databricks audit logs do not include DBFS-related operations. We recommend that you set up server access logging in S3, which can log object-level operations associated with an IAM role. If you map IAM roles to Databricks users, your Databricks users cannot share IAM roles.
  • Automated actions—such as resizing a cluster due to autoscaling or launching a job due to scheduling—are performed by the user System-User.

Request parameters

The request parameters (field requestParams) for each supported service and action are listed in the following table:

Service Action Request Parameters
accounts add ["targetUserName","endpoint","targetUserId"]
  addPrincipalToGroup ["targetGroupId","endpoint","targetUserId","targetGroupName","targetUserName"]
  changePassword ["newPasswordSource","targetUserId","serviceSource","wasPasswordChanged","userId"]
  createGroup ["endpoint","targetGroupId","targetGroupName"]
  delete ["targetUserId","targetUserName","endpoint"]
  garbageCollectDbToken ["tokenExpirationTime","userId"]
  generateDbToken ["userId","tokenExpirationTime"]
  jwtLogin ["user"]
  login ["user"]
  logout ["user"]
  removeAdmin ["targetUserName","endpoint","targetUserId"]
  removeGroup ["targetGroupId","targetGroupName","endpoint"]
  resetPassword ["serviceSource","userId","endpoint","targetUserId","targetUserName","wasPasswordChanged","newPasswordSource"]
  revokeDbToken ["userId"]
  samlLogin ["user"]
  setAdmin ["endpoint","targetUserName","targetUserId"]
  tokenLogin ["tokenId","user"]
  validateEmail ["endpoint","targetUserName","targetUserId"]
clusters changeClusterAcl ["shardName","aclPermissionSet","targetUserId","resourceId"]
  create ["cluster_log_conf","num_workers","enable_elastic_disk","driver_node_type_id","start_cluster","docker_image","ssh_public_keys","aws_attributes","acl_path_prefix","node_type_id","instance_pool_id","spark_env_vars","init_scripts","spark_version","cluster_source","autotermination_minutes","cluster_name","autoscale","custom_tags","cluster_creator","enable_local_disk_encryption","idempotency_token","spark_conf","organization_id","no_driver_daemon","user_id"]
  createResult ["clusterName","clusterState","clusterId","clusterWorkers","clusterOwnerUserId"]
  delete ["cluster_id"]
  deleteResult ["clusterWorkers","clusterState","clusterId","clusterOwnerUserId","clusterName"]
  edit ["spark_env_vars","no_driver_daemon","enable_elastic_disk","aws_attributes","driver_node_type_id","custom_tags","cluster_name","spark_conf","ssh_public_keys","autotermination_minutes","cluster_source","docker_image","enable_local_disk_encryption","cluster_id","spark_version","autoscale","cluster_log_conf","instance_pool_id","num_workers","init_scripts","node_type_id"]
  permanentDelete ["cluster_id"]
  resize ["cluster_id","num_workers","autoscale"]
  resizeResult ["clusterWorkers","clusterState","clusterId","clusterOwnerUserId","clusterName"]
  restart ["cluster_id"]
  restartResult ["clusterId","clusterState","clusterName","clusterOwnerUserId","clusterWorkers"]
  start ["init_scripts_safe_mode","cluster_id"]
  startResult ["clusterName","clusterState","clusterWorkers","clusterOwnerUserId","clusterId"]
dbfs addBlock ["handle","data_length"]
  create ["path","bufferSize","overwrite"]
  delete ["recursive","path"]
  getSessionCredentials ["mountPoint"]
  mkdirs ["path"]
  mount ["mountPoint","owner"]
  move ["dst","source_path","src","destination_path"]
  put ["path","overwrite"]
  unmount ["mountPoint"]
genie databricksAccess ["duration","approver","reason","authType","user"]
globalInitScripts create ["name","position","script-SHA256","enabled"]
  update ["script_id","name","position","script-SHA256","enabled"]
  delete ["script_id"]
groups addPrincipalToGroup ["user_name","parent_name"]
  createGroup ["group_name"]
  getGroupMembers ["group_name"]
  removeGroup ["group_name"]
iamRole changeIamRoleAcl ["targetUserId","shardName","resourceId","aclPermissionSet"]
instancePools changeInstancePoolAcl ["shardName","resourceId","targetUserId","aclPermissionSet"]
  create ["enable_elastic_disk","preloaded_spark_versions","idle_instance_autotermination_minutes","instance_pool_name","node_type_id","custom_tags","max_capacity","min_idle_instances","aws_attributes"]
  delete ["instance_pool_id"]
  edit ["instance_pool_name","idle_instance_autotermination_minutes","min_idle_instances","preloaded_spark_versions","max_capacity","enable_elastic_disk","node_type_id","instance_pool_id","aws_attributes"]
jobs cancel ["run_id"]
  changeJobAcl ["shardName","aclPermissionSet","resourceId","targetUserId"]
  create ["spark_jar_task","email_notifications","notebook_task","spark_submit_task","timeout_seconds","libraries","name","spark_python_task","job_type","new_cluster","existing_cluster_id","max_retries","schedule"]
  delete ["job_id"]
  deleteRun ["run_id"]
  reset ["job_id","new_settings"]
  resetJobAcl ["grants","job_id"]
  runFailed ["jobClusterType","jobTriggerType","jobId","jobTaskType","runId","jobTerminalState","idInJob","orgId"]
  runNow ["notebook_params","job_id","jar_params","workflow_context"]
  runSucceeded ["idInJob","jobId","jobTriggerType","orgId","runId","jobClusterType","jobTaskType","jobTerminalState"]
  submitRun ["shell_command_task","run_name","spark_python_task","existing_cluster_id","notebook_task","timeout_seconds","libraries","new_cluster","spark_jar_task"]
  update ["fields_to_remove","job_id","new_settings"]
mlflowExperiment deleteMlflowExperiment ["experimentId","path","experimentName"]
  moveMlflowExperiment ["newPath","experimentId","oldPath"]
  restoreMlflowExperiment ["experimentId","path","experimentName"]
notebook attachNotebook ["path","clusterId","notebookId"]
  createNotebook ["notebookId","path"]
  deleteFolder ["path"]
  deleteNotebook ["notebookId","notebookName","path"]
  detachNotebook ["notebookId","clusterId","path"]
  importNotebook ["path"]
  moveNotebook ["newPath","oldPath","notebookId"]
  renameNotebook ["newName","oldName","parentPath","notebookId"]
  restoreFolder ["path"]
  restoreNotebook ["path","notebookId","notebookName"]
  takeNotebookSnapshot ["path"]
secrets createScope ["scope"]
  deleteScope ["scope"]
  deleteSecret ["key","scope"]
  getSecret ["scope","key"]
  listAcls ["scope"]
  listSecrets ["scope"]
  putSecret ["string_value","scope","key"]
sqlPermissions createSecurable ["securable"]
  grantPermission ["permission"]
  removeAllPermissions ["securable"]
  requestPermissions ["requests"]
  revokePermission ["permission"]
  showPermissions ["securable","principal"]
ssh login ["containerId","userName","port","publicKey","instanceId"]
  logout ["userName","containerId","instanceId"]
workspace changeWorkspaceAcl ["shardName","targetUserId","aclPermissionSet","resourceId"]
  fileCreate ["path"]
  fileDelete ["path"]
  moveWorkspaceNode ["destinationPath","path"]
  purgeWorkspaceNodes ["treestoreId"]

Analyze audit logs

You can analyze audit logs using Databricks. The following example uses logs to report on Databricks access and Apache Spark versions. This applies to both the Public Preview log delivery with Account API and the legacy audit logs, although the S3 delivery paths are different and the legacy logs are gzipped.

Load audit logs as a DataFrame and register the DataFrame as a temp table. See Amazon S3 for a detailed guide.

val df = sqlContext.read.json("s3a://bucketName/path/to/auditLogs")
df.createOrReplaceTempView("audit_logs")

List the users who accessed Databricks and from where.

%sql
SELECT DISTINCT userIdentity.email, sourceIPAddress
FROM audit_logs
WHERE serviceName = "accounts" AND actionName LIKE "%login%"

Check the Apache Spark versions used.

%sql
SELECT requestParams.spark_version, COUNT(*)
FROM audit_logs
WHERE serviceName = "clusters" AND actionName = "create"
GROUP BY requestParams.spark_version

Check table data access.

%sql
SELECT *
FROM audit_logs
WHERE serviceName = "sqlPermissions" AND actionName = "requestPermissions"

Legacy audit log delivery

Important

This section describes the legacy audit log delivery framework. For the audit log delivery framework that uses the Account API, see Configure audit log delivery. Databricks recommends migrating to the new audit log delivery framework, which supports low-latency delivery (typically within 15 minutes) of an auditable event. To migrate, please enable the new log delivery, then disable the legacy audit log delivery configuration.

If your account is enabled for audit logging, the Databricks account owner configures where Databricks sends the logs. Admin users cannot configure audit log delivery.

  1. Log in to the Account Console.

  2. Click the Audit Logs tab.

  3. Configure the S3 bucket and directory:

    • S3 Bucket in <region name>: the S3 bucket where you want to store your audit logs. The bucket must exist.
    • Path: the path to the directory in the S3 bucket where you want to store the audit logs. For example, /databricks/auditlogs. If you want to store the logs at the bucket root, enter /.

    Databricks sends the audit logs to the specified S3 bucket and directory path, partitioned by date. For example, my-bucket/databricks/auditlogs/date=2018-01-15/part-0.json.gz.

Once logging is enabled for your account, Databricks automatically starts sending audit logs in human-readable format to your delivery location on a periodic basis. Logs are available within 72 hours of activation.

  • Encryption: Databricks encrypts audit logs using Amazon S3 server-side encryption.
  • Format: Databricks delivers audit logs in JSON files in gzip-compressed archives with file extension json.gz.
  • When: Databricks delivers audit logs daily and partitions the logs by date in yyyy-MM-dd format.
  • Other details:
    • Databricks delivers logs within 72 hours after day close.
    • Each audit log record is unique.

Note

  • Databricks can overwrite the delivered log files in your bucket at any time during the three-day period after the log date. After three days, audit files become immutable. In other words, logs for 2018-01-06 are subject to overwrites through 2018-01-09, and you can safely archive them on 2018-01-10.
  • Overwriting ensures exactly-once semantics without requiring read or delete access to your account.

Configure access policy (legacy)

To configure Databricks access to your AWS S3 bucket using an access policy, follow the steps in this section.

Step 1: Generate the access policy (legacy)

In the Databricks Account Console, on the Audit Logs tab:

  1. Click the Generate Policy button. The generated policy should look like:

     {
       "Version": "2012-10-17",
       "Id": "DatabricksAuditLogs",
       "Statement": [
         {
           "Sid": "PutAuditLogs",
           "Effect": "Allow",
           "Principal": {
             "AWS": "arn:aws:iam::090101015318:role/DatabricksAuditLogs-WriterRole-VV4KJWX4FRIK"
           },
           "Action": [
             "s3:PutObject"
           ],
           "Resource": "arn:aws:s3:::AUDIT_LOG_BUCKET/audit_log_path/*"
         },
         {
           "Sid": "DenyNotContainingFullAccess",
           "Effect": "Deny",
           "Principal": {
             "AWS": "arn:aws:iam::090101015318:role/DatabricksAuditLogs-WriterRole-VV4KJWX4FRIK"
           },
           "Action": [
             "s3:PutObject"
           ],
           "Resource": "arn:aws:s3:::AUDIT_LOG_BUCKET/audit_log_path/*",
           "Condition": {
             "StringNotEquals": {
               "s3:x-amz-acl": "bucket-owner-full-control"
             }
           }
         }
       ]
     }
    

    This policy ensures that the Databricks AWS account has write permission on the bucket and directory that you specified. The first section grants Databricks write permissions. Databricks does not have read, list, or delete permission. The second section ensures that you have full control over everything that Databricks writes to your bucket.

  2. Copy the generated JSON policy to your clipboard.

Step 2: Apply the policy to the AWS S3 bucket (legacy)

  1. In the AWS console, go to the S3 service.
  2. Click the name of the bucket where you want to store the audit logs.
  3. Click the Permissions tab.
  4. Click the Bucket Policy button.
  5. Paste the policy string from Step 1.
  6. Click Save.

Step 3: Verify that the policy is applied correctly (legacy)

In the Databricks Account Console, on the Audit Logs tab, click the Verify Access button.

Verify access

If you see a check mark check, audit logs are configured correctly. If verification fails:

  1. Check that you entered the bucket name correctly, and that the AWS region is correct.
  2. Check that you copied the generated policy correctly to AWS.
  3. Contact your AWS account admin.