Data exfiltration protection architecture

This page is a feature-by-feature reference architecture for network-level data exfiltration protection on AWS. Each section describes one control, like identity, Unity Catalog governance, workspace restrictions, monitoring, and cloud-specific network isolation, and links to its implementation guide. For the concepts and security layer priorities behind these controls, see Data exfiltration protection.

To deploy the full set of controls as a single bundle, use the Databricks Security Reference Architecture Terraform module, which implements the Isolated environment architecture end-to-end. See the AWS Security Reference Architecture Terraform module.
To configure controls individually, use the guide below.

Identity and access controls

Identity-based controls are the first line of defense against data exfiltration. Without strong authentication and trusted access, weak identity undermines network-level controls.

Unified login with SSO

Apply single sign-on (SSO) across all workspaces in the Databricks account using unified login. This ensures users authenticate through your corporate identity provider rather than using personal accounts or non-SSO methods.

Enable multifactor authentication (MFA) within your identity provider for an additional layer of verification.

See Enable unified login and Configure SSO in Databricks.

Automated identity management

Implement SCIM provisioning to automate user lifecycle management. This ensures that former employees are automatically de-provisioned and cannot access workspaces after departure.

See Sync users and groups from your identity provider using SCIM.

Network access controls

Restrict workspace and account console access to trusted networks:

Account-level IP access lists: Control access to the account console. See Configure IP access lists for the account console.
Workspace-level IP access lists: Control access to individual workspaces. See Configure IP access lists for workspaces.
Private connectivity: Use inbound PrivateLink to eliminate public workspace access entirely. See Configure inbound PrivateLink for workspaces.

Data governance controls

Network controls prevent unauthorized egress paths, but data governance controls ensure that even authorized compute resources can only access approved data destinations. Apply these controls regardless of which network security architecture you deploy.

Standard access control

Use Unity Catalog privileges to restrict who can read, write, or modify each catalog, schema, table, and volume. Grant the minimum privileges required for each role and group.

Privileges flow hierarchically: a grant on a catalog applies to all schemas and tables within it. Use this to enforce broad defaults, then narrow access at lower levels for sensitive data.

See Manage privileges in Unity Catalog.

Attribute-based access control (ABAC)

ABAC governs data access based on tags attached to data objects, not just object identity. Use ABAC to enforce policies like "users can only query tables tagged pii=false" or "users in the EU group cannot read tables tagged region=US."

ABAC scales better than per-object GRANTs in large environments where tagging conventions are already in place. It also pairs well with row filters and column masks (below).

See Attribute-based access control in Unity Catalog.

Row filters and column masks

Restrict what users see within a table:

Row filters: Apply a SQL function that determines which rows a user can query. For example, restrict a sales table so each regional manager only sees rows for their region.
Column masks: Apply a SQL function that transforms a column's value before it returns to the user. For example, mask credit card numbers to XXXX-XXXX-XXXX-1234 for non-finance users.

Row filters and column masks are evaluated at query time, so users can't bypass them with SELECT *.

See Row filters and column masks.

Unity Catalog administrative restrictions

Restrict the creation of data access securables to administrators only:

Storage credentials: Only allow admins to create storage credentials. Apply least-privilege cloud access policies (IAM roles, managed identities) for each credential. See Manage storage credentials.
External locations: Only allow admins to create external locations that map to cloud storage paths. See Manage external locations.
Database connections: Only allow admins to create connections to external databases through Lakehouse Federation. See Manage connections for Lakehouse Federation.
Service credentials: Only allow admins to create service credentials for external cloud services. See Create service credentials.

Grant users permissions to use approved securables rather than create new ones. This prevents users from pointing compute at untrusted storage or endpoints.

Workspace bindings for catalogs

Bind Unity Catalog catalogs to specific workspaces to prevent cross-environment data access. For example, prevent development workspaces from reading production data.

See Workspace-catalog binding.

Storage account policies

Implement firewalls or bucket policies on storage accounts to accept traffic only from approved source destinations:

Configure S3 bucket policies to allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.
Create IAM roles with minimal permissions and trust policies limiting which Databricks resources can assume them.

Workspace restrictions

Workspace admin settings control data download and export paths through the Databricks UI. Disable these settings to prevent users from extracting data through the workspace interface.

Setting	Risk mitigated
Disable notebook results download	Users downloading query results to local machines
Disable volume files download	Users downloading volume files to local machines
Disable notebook and file exporting	Users exporting notebooks or files from the workspace
Disable SQL results download	Users downloading SQL query results
Disable MLflow run artifact download	Users downloading MLflow experiment artifacts
Disable results table clipboard	Users copying tabular data to the clipboard

Setting	Risk mitigated
Disable notebook results download	Users downloading query results to local machines
Disable volume files download	Users downloading volume files to local machines
Disable notebook and file exporting	Users exporting notebooks or files from the workspace
Disable SQL results download	Users downloading SQL query results
Disable MLflow run artifact download	Users downloading MLflow experiment artifacts
Disable results table clipboard	Users copying tabular data to the clipboard

Configure these settings in the workspace admin console under security settings. See Manage your workspace.

Monitoring and detection

Preventive controls reduce the risk of data exfiltration, but monitoring detects when controls fail or when attackers bypass them.

System tables for audit monitoring

Use Databricks Monitor costs using system tables to monitor data access patterns. The Audit log system table reference captures workspace events including:

User authentication and access attempts.
Data read and write operations.
Administrative configuration changes.
Credential usage and external location access.

Set up alerts for suspicious activity, such as unusual data volumes, access from unexpected locations, or attempts to access unauthorized resources.

Cloud-native log integration

Ingest cloud-specific logs to supplement Databricks system tables:

Configure AWS CloudTrail to capture S3 access events, IAM role assumptions, and VPC flow logs.

Correlate cloud-native logs with Databricks audit logs for complete visibility into data movement across your environment.

AWS architecture

Network isolation

Deploy Databricks in a Configure a customer-managed VPC with private subnets:

Enable Classic compute plane networking to eliminate public IPs.
Configure security groups to restrict egress to authorized destinations only.
Use route tables to prevent direct internet access.

Private connectivity

Establish private connections to AWS services and the Databricks control plane:

Classic compute plane PrivateLink: Connect to workspace and SCC relay. See Configure classic private connectivity to Databricks.
Inbound PrivateLink: Enable user access without public internet. See Configure inbound PrivateLink for workspaces.
VPC endpoints: Create S3 gateway endpoint (no cost) and interface endpoints for STS and Kinesis.
VPC endpoint policies: Restrict access to authorized AWS resources only.

Egress control

Deploy a third-party firewall appliance (such as Palo Alto) integrated with Gateway Load Balancer to inspect outbound traffic:

Configure firewall rules for approved destinations (for example, PyPI, Maven, and external APIs).
Route internet-bound traffic (0.0.0.0/0) through the firewall.
Route AWS service traffic through VPC endpoints.

Access policies

Implement least-privilege access using IAM and bucket policies:

IAM roles: Create roles with minimal permissions and trust policies limiting which Databricks resources can assume them.
S3 bucket policies: Allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.

Serverless security

Configure What is serverless egress control? for serverless compute egress control. Define allowed destinations using IP ranges, FQDNs, or private endpoints.

Identity and access controls​

Data governance controls​

Workspace restrictions​

Monitoring and detection​

AWS architecture​

See also​

Identity and access controls

Data governance controls

Workspace restrictions

Monitoring and detection

AWS architecture

See also