Monitor and manage Delta Sharing egress costs (for providers)

This article describes tools that you can use to monitor and manage cloud vendor egress costs when you share data and AI assets using Delta Sharing.

Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. If you use Delta Sharing to share data and AI assets within a region, you incur no egress cost.

To monitor and manage egress charges, Databricks provides:

Delta Sharing egress pipeline notebooks

In Databricks Marketplace, the listing Delta Sharing Egress Pipeline includes two notebooks that you can clone and use to monitor egress usage patterns and costs associated with Delta Sharing. Both of these notebooks create and execute a Delta Live Tables pipeline:

  • IP Ranges Mapping Pipeline notebook

  • Egress Cost Analysis Pipeline notebook

When you run these notebooks as a Delta Live Tables template, they will automatically generate a detailed cost report. Logs are joined with cloud provider IP range tables and Delta Sharing system tables to generate egress bytes transferred, attributed by share and recipient.

Complete requirements and instructions are available in the listing.

Replicate data to avoid egress costs

One approach to avoiding egress costs is for the provider to create and sync local replicas of shared data in regions that their recipients are using. Another approach is for recipients to clone the shared data to local regions for active querying, setting up syncs between the shared table and the local clone. This section discusses a number of replication patterns.

Use Delta deep clone for incremental replication

Providers can use DEEP CLONE to replicate Delta tables to external locations across the regions that they share to. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.

CREATE TABLE [IF NOT EXISTS] table_name DEEP CLONE source_table_name
   [TBLPROPERTIES clause] [LOCATION path];

You can schedule a Databricks job to refresh target table data incrementally with recent updates in the shared table, using the following command:

CREATE OR REPLACE TABLE table_name DEEP CLONE source_table_name;

See Clone a table on Databricks and Schedule and orchestrate workflows.

Enable change data feed (CDF) on shared tables for incremental replication

When a table is shared with its CDF, the recipient can access the changes and merge them into a local copy of the table, where users perform queries. In this scenario, recipient access to the data does not cross region boundaries, and egress is limited to refreshing a local copy. If the recipient is on Databricks, they can use a Databricks workflow job to propagate changes to a local replica.

To share a table with CDF, you must enable CDF on the table and share it WITH HISTORY.

For more information about using CDF, see Use Delta Lake change data feed on Databricks and Add tables to a share.

Use Cloudflare R2 replicas or migrate storage to R2

Cloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data using Delta Sharing without incurring egress fees. This section describes how to replicate data to an R2 location and enable incremental updates from source tables.

Requirements

  • Databricks workspace enabled for Unity Catalog.

  • Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above.

  • Cloudflare account. See https://dash.cloudflare.com/sign-up.

  • Cloudflare R2 Admin role. See the Cloudflare roles documentation.

  • CREATE STORAGE CREDENTIAL privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.

  • CREATE EXTERNAL LOCATION privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have this privilege by default.

  • CREATE MANAGED STORAGE privilege on the external location.

  • CREATE CATALOG on the metastore. Metastore admins have this privilege by default.

Limitations for Cloudflare R2

Providers can’t share R2 tables that use liquid clustering and V2 checkpoint.

Mount an R2 bucket as an external location in Databricks

  1. Create a Cloudflare R2 bucket.

    See Configure an R2 bucket.

  2. Create a storage credential in Unity Catalog that gives access to the R2 bucket.

    See Create the storage credential.

  3. Use the storage credential to create an external location in Unity Catalog.

    See Create an external location to connect cloud storage to Databricks.

Create a new catalog using the external location

Create a catalog that uses the new external location as its managed storage location.

See Create catalogs.

When you create the catalog, do the following:

  • Select a Standard catalog type.

  • Under Storage location, select Select a storage location and enter the path to the R2 bucket you defined as an external location. For example, r2://mybucket@my-account-id.r2.cloudflarestorage.com

Use the path to the R2 bucket you defined as an external location. For example:

  CREATE CATALOG IF NOT EXISTS my-r2-catalog
    MANAGED LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com'
    COMMENT 'Location for managed tables and volumes to share using Delta Sharing';

Clone the data you want to share to a table in the new catalog

Use DEEP CLONE to replicate tables in S3 to the new catalog that uses R2 for managed storage. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.

CREATE TABLE IF NOT EXISTS new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table
  LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com';

You can schedule a Databricks job to refresh target table data incrementally with recent updates in the source table, using the following command:

CREATE OR REPLACE TABLE new_catalog.schema1.new_table DEEP CLONE old_catalog.schema1.source_table;

See Clone a table on Databricks and Schedule and orchestrate workflows.

Share the new table

When you create the share, add the tables that are in the new catalog, stored in R2. The process is the same as adding any table to a share.

See Create and manage shares for Delta Sharing.