Data exfiltration protection architecture
This page is a feature-by-feature reference architecture for network-level data exfiltration protection on AWS. Each section describes one control, like identity, Unity Catalog governance, workspace restrictions, monitoring, and cloud-specific network isolation, and links to its implementation guide. For the concepts and security layer priorities behind these controls, see Data exfiltration protection.
- To deploy the full set of controls as a single bundle, use the Databricks Security Reference Architecture Terraform module, which implements the Isolated environment architecture end-to-end. See the AWS Security Reference Architecture Terraform module.
- To configure controls individually, use the guide below.
Identity and access controls
Identity-based controls are the first line of defense against data exfiltration. Without strong authentication and trusted access, weak identity undermines network-level controls.
Unified login with SSO
Apply single sign-on (SSO) across all workspaces in the Databricks account using unified login. This ensures users authenticate through your corporate identity provider rather than using personal accounts or non-SSO methods.
Enable multifactor authentication (MFA) within your identity provider for an additional layer of verification.
Automated identity management
Implement SCIM provisioning to automate user lifecycle management. This ensures that former employees are automatically de-provisioned and cannot access workspaces after departure.
See Sync users and groups from your identity provider using SCIM.
Network access controls
Restrict workspace and account console access to trusted networks:
- Account-level IP access lists: Control access to the account console. See Configure IP access lists for the account console.
- Workspace-level IP access lists: Control access to individual workspaces. See Configure IP access lists for workspaces.
- Private connectivity: Use inbound PrivateLink to eliminate public workspace access entirely. See Configure Inbound PrivateLink.
Data governance controls
Network controls prevent unauthorized egress paths, but data governance controls ensure that even authorized compute resources can only access approved data destinations. Apply these controls regardless of which network security architecture you deploy.
Standard access control
Use Unity Catalog privileges to restrict who can read, write, or modify each catalog, schema, table, and volume. Grant the minimum privileges required for each role and group.
Privileges flow hierarchically: a grant on a catalog applies to all schemas and tables within it. Use this to enforce broad defaults, then narrow access at lower levels for sensitive data.
Attribute-based access control (ABAC)
ABAC governs data access based on tags attached to data objects, not just object identity. Use ABAC to enforce policies like "users can only query tables tagged pii=false" or "users in the EU group cannot read tables tagged region=US."
ABAC scales better than per-object GRANTs in large environments where tagging conventions are already in place. It also pairs well with row filters and column masks (below).
Row filters and column masks
Restrict what users see within a table:
- Row filters: Apply a SQL function that determines which rows a user can query. For example, restrict a sales table so each regional manager only sees rows for their region.
- Column masks: Apply a SQL function that transforms a column's value before it returns to the user. For example, mask credit card numbers to
XXXX-XXXX-XXXX-1234for non-finance users.
Row filters and column masks are evaluated at query time, so users can't bypass them with SELECT *.
Unity Catalog administrative restrictions
Restrict the creation of data access securables to administrators only:
- Storage credentials: Only allow admins to create storage credentials. Apply least-privilege cloud access policies (IAM roles, managed identities) for each credential. See Manage storage credentials.
- External locations: Only allow admins to create external locations that map to cloud storage paths. See Manage external locations.
- Database connections: Only allow admins to create connections to external databases through Lakehouse Federation. See Manage connections for Lakehouse Federation.
- Service credentials: Only allow admins to create service credentials for external cloud services. See Create service credentials.
Grant users permissions to use approved securables rather than create new ones. This prevents users from pointing compute at untrusted storage or endpoints.
Workspace bindings for catalogs
Bind Unity Catalog catalogs to specific workspaces to prevent cross-environment data access. For example, prevent development workspaces from reading production data.
Storage account policies
Implement firewalls or bucket policies on storage accounts to accept traffic only from approved source destinations:
- Configure S3 bucket policies to allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.
- Create IAM roles with minimal permissions and trust policies limiting which Databricks resources can assume them.
Workspace restrictions
Workspace admin settings control data download and export paths through the Databricks UI. Disable these settings to prevent users from extracting data through the workspace interface.
Setting | Risk mitigated |
|---|---|
Disable notebook results download | Users downloading query results to local machines |
Disable volume files download | Users downloading volume files to local machines |
Disable notebook and file exporting | Users exporting notebooks or files from the workspace |
Disable SQL results download | Users downloading SQL query results |
Disable MLflow run artifact download | Users downloading MLflow experiment artifacts |
Disable results table clipboard | Users copying tabular data to the clipboard |
Configure these settings in the workspace admin console under security settings. See Manage your workspace.
Monitoring and detection
Preventive controls reduce the risk of data exfiltration, but monitoring detects when controls fail or when attackers bypass them.
System tables for audit monitoring
Use Databricks Monitor costs using system tables to monitor data access patterns. The Audit log system table reference captures workspace events including:
- User authentication and access attempts.
- Data read and write operations.
- Administrative configuration changes.
- Credential usage and external location access.
Set up alerts for suspicious activity, such as unusual data volumes, access from unexpected locations, or attempts to access unauthorized resources.
Cloud-native log integration
Ingest cloud-specific logs to supplement Databricks system tables:
- Configure AWS CloudTrail to capture S3 access events, IAM role assumptions, and VPC flow logs.
Correlate cloud-native logs with Databricks audit logs for complete visibility into data movement across your environment.
AWS architecture
Network isolation
Deploy Databricks in a Configure a customer-managed VPC with private subnets:
- Enable Classic compute plane networking to eliminate public IPs.
- Configure security groups to restrict egress to authorized destinations only.
- Use route tables to prevent direct internet access.
Private connectivity
Establish private connections to AWS services and the Databricks control plane:
- Classic compute plane PrivateLink: Connect to workspace and SCC relay. See Configure classic private connectivity to Databricks.
- Inbound PrivateLink: Enable user access without public internet. See Configure Inbound PrivateLink.
- VPC endpoints: Create S3 gateway endpoint (no cost) and interface endpoints for STS and Kinesis.
- VPC endpoint policies: Restrict access to authorized AWS resources only.
Egress control
Deploy a third-party firewall appliance (such as Palo Alto) integrated with Gateway Load Balancer to inspect outbound traffic:
- Configure firewall rules for approved destinations (for example, PyPI, Maven, and external APIs).
- Route internet-bound traffic (
0.0.0.0/0) through the firewall. - Route AWS service traffic through VPC endpoints.
Access policies
Implement least-privilege access using IAM and bucket policies:
- IAM roles: Create roles with minimal permissions and trust policies limiting which Databricks resources can assume them.
- S3 bucket policies: Allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.
Serverless security
Configure What is serverless egress control? for serverless compute egress control. Define allowed destinations using IP ranges, FQDNs, or private endpoints.
See also
-
- Network reference architectures
- Network security architectures (managed, hardened, isolated).
-
- Security and compliance
- Security and compliance controls beyond networking.