Skip to main content

Data exfiltration protection architecture

This page is a feature-by-feature reference architecture for network-level data exfiltration protection on AWS. Each section describes one control, like identity, Unity Catalog governance, workspace restrictions, monitoring, and cloud-specific network isolation, and links to its implementation guide. For the concepts and security layer priorities behind these controls, see Data exfiltration protection.

Identity and access controls

Identity-based controls are the first line of defense against data exfiltration. Without strong authentication and trusted access, weak identity undermines network-level controls.

User shield icon. Unified login with SSO

Apply single sign-on (SSO) across all workspaces in the Databricks account using unified login. This ensures users authenticate through your corporate identity provider rather than using personal accounts or non-SSO methods.

Enable multifactor authentication (MFA) within your identity provider for an additional layer of verification.

See Enable unified login and Configure SSO in Databricks.

User group icon. Automated identity management

Implement SCIM provisioning to automate user lifecycle management. This ensures that former employees are automatically de-provisioned and cannot access workspaces after departure.

See Sync users and groups from your identity provider using SCIM.

Globe icon. Network access controls

Restrict workspace and account console access to trusted networks:

Data governance controls

Network controls prevent unauthorized egress paths, but data governance controls ensure that even authorized compute resources can only access approved data destinations. Apply these controls regardless of which network security architecture you deploy.

Key icon. Standard access control

Use Unity Catalog privileges to restrict who can read, write, or modify each catalog, schema, table, and volume. Grant the minimum privileges required for each role and group.

Privileges flow hierarchically: a grant on a catalog applies to all schemas and tables within it. Use this to enforce broad defaults, then narrow access at lower levels for sensitive data.

See Manage privileges in Unity Catalog.

Tag icon. Attribute-based access control (ABAC)

ABAC governs data access based on tags attached to data objects, not just object identity. Use ABAC to enforce policies like "users can only query tables tagged pii=false" or "users in the EU group cannot read tables tagged region=US."

ABAC scales better than per-object GRANTs in large environments where tagging conventions are already in place. It also pairs well with row filters and column masks (below).

See Attribute-based access control in Unity Catalog.

Filter icon. Row filters and column masks

Restrict what users see within a table:

  • Row filters: Apply a SQL function that determines which rows a user can query. For example, restrict a sales table so each regional manager only sees rows for their region.
  • Column masks: Apply a SQL function that transforms a column's value before it returns to the user. For example, mask credit card numbers to XXXX-XXXX-XXXX-1234 for non-finance users.

Row filters and column masks are evaluated at query time, so users can't bypass them with SELECT *.

See Row filters and column masks.

User shield icon. Unity Catalog administrative restrictions

Restrict the creation of data access securables to administrators only:

  • Storage credentials: Only allow admins to create storage credentials. Apply least-privilege cloud access policies (IAM roles, managed identities) for each credential. See Manage storage credentials.
  • External locations: Only allow admins to create external locations that map to cloud storage paths. See Manage external locations.
  • Database connections: Only allow admins to create connections to external databases through Lakehouse Federation. See Manage connections for Lakehouse Federation.
  • Service credentials: Only allow admins to create service credentials for external cloud services. See Create service credentials.

Grant users permissions to use approved securables rather than create new ones. This prevents users from pointing compute at untrusted storage or endpoints.

Catalog gear icon. Workspace bindings for catalogs

Bind Unity Catalog catalogs to specific workspaces to prevent cross-environment data access. For example, prevent development workspaces from reading production data.

See Workspace-catalog binding.

Database icon. Storage account policies

Implement firewalls or bucket policies on storage accounts to accept traffic only from approved source destinations:

  • Configure S3 bucket policies to allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.
  • Create IAM roles with minimal permissions and trust policies limiting which Databricks resources can assume them.

Workspace restrictions

Workspace admin settings control data download and export paths through the Databricks UI. Disable these settings to prevent users from extracting data through the workspace interface.

Setting

Risk mitigated

Disable notebook results download

Users downloading query results to local machines

Disable volume files download

Users downloading volume files to local machines

Disable notebook and file exporting

Users exporting notebooks or files from the workspace

Disable SQL results download

Users downloading SQL query results

Disable MLflow run artifact download

Users downloading MLflow experiment artifacts

Disable results table clipboard

Users copying tabular data to the clipboard

Configure these settings in the workspace admin console under security settings. See Manage your workspace.

Monitoring and detection

Preventive controls reduce the risk of data exfiltration, but monitoring detects when controls fail or when attackers bypass them.

Alerts icon. System tables for audit monitoring

Use Databricks Monitor costs using system tables to monitor data access patterns. The Audit log system table reference captures workspace events including:

  • User authentication and access attempts.
  • Data read and write operations.
  • Administrative configuration changes.
  • Credential usage and external location access.

Set up alerts for suspicious activity, such as unusual data volumes, access from unexpected locations, or attempts to access unauthorized resources.

Cloud icon. Cloud-native log integration

Ingest cloud-specific logs to supplement Databricks system tables:

  • Configure AWS CloudTrail to capture S3 access events, IAM role assumptions, and VPC flow logs.

Correlate cloud-native logs with Databricks audit logs for complete visibility into data movement across your environment.

AWS architecture

Shield icon. Network isolation

Deploy Databricks in a Configure a customer-managed VPC with private subnets:

  • Enable Classic compute plane networking to eliminate public IPs.
  • Configure security groups to restrict egress to authorized destinations only.
  • Use route tables to prevent direct internet access.

Link icon. Private connectivity

Establish private connections to AWS services and the Databricks control plane:

Filter icon. Egress control

Deploy a third-party firewall appliance (such as Palo Alto) integrated with Gateway Load Balancer to inspect outbound traffic:

  • Configure firewall rules for approved destinations (for example, PyPI, Maven, and external APIs).
  • Route internet-bound traffic (0.0.0.0/0) through the firewall.
  • Route AWS service traffic through VPC endpoints.

Key icon. Access policies

Implement least-privilege access using IAM and bucket policies:

  • IAM roles: Create roles with minimal permissions and trust policies limiting which Databricks resources can assume them.
  • S3 bucket policies: Allow access only from the Databricks VPC or specific VPC endpoints. Use condition keys to restrict access based on source.

Shield check icon. Serverless security

Configure What is serverless egress control? for serverless compute egress control. Define allowed destinations using IP ranges, FQDNs, or private endpoints.

See also