Monitor and manage Delta Sharing egress costs (for providers)
This article describes tools that you can use to monitor and manage cloud vendor egress costs when you share data and AI assets using Delta Sharing.
Unlike other data sharing platforms, Delta Sharing does not require data replication. This model has many advantages, but it means that your cloud vendor may charge data egress fees when you share data across clouds or regions. If you use Delta Sharing to share data and AI assets within a region, you incur no egress cost.
To monitor and manage egress charges, Databricks provides:
Notebooks that you can use to run a DLT pipeline that monitors egress usage patterns and cost.
Instructions for replicating data between regions to avoid egress fees.
Delta Sharing egress pipeline notebooks
In Databricks Marketplace, the listing Delta Sharing Egress Pipeline includes two notebooks that you can clone and use to monitor egress usage patterns and costs associated with Delta Sharing. Both of these notebooks create and execute a Delta Live Tables pipeline:
IP Ranges Mapping Pipeline notebook
Egress Cost Analysis Pipeline notebook
When you run these notebooks as a Delta Live Tables template, they will automatically generate a detailed cost report. Logs are joined with cloud provider IP range tables and Delta Sharing system tables to generate egress bytes transferred, attributed by share and recipient.
Complete requirements and instructions are available in the listing.
Replicate data to avoid egress costs
One approach to avoiding egress costs is for the provider to create and sync local replicas of shared data in regions that their recipients are using. Another approach is for recipients to clone the shared data to local regions for active querying, setting up syncs between the shared table and the local clone. This section discusses a number of replication patterns.
Use Delta deep clone for incremental replication
Providers can use DEEP CLONE
to replicate Delta tables to external locations across the regions that they share to. Deep clones copy the source table data and metadata to the clone target. Deep clones also enable incremental updates by identifying new data in the source table and refreshing the target accordingly.
CREATE TABLE [IF NOT EXISTS] table_name DEEP CLONE source_table_name
[TBLPROPERTIES clause] [LOCATION path];
You can schedule a Databricks job to refresh target table data incrementally with recent updates in the shared table, using the following command:
CREATE OR REPLACE TABLE table_name DEEP CLONE source_table_name;
See Clone a table on Databricks and Schedule and orchestrate workflows.
Use Cloudflare R2 replicas or migrate storage to R2
Cloudflare R2 object storage incurs no egress fees. Replicating or migrating data that you share to R2 enables you to share data using Delta Sharing without incurring egress fees. This section describes how to replicate data to an R2 location and enable incremental updates from source tables.
Requirements
Databricks workspace enabled for Unity Catalog.
Databricks Runtime 14.3 or above, or SQL warehouse 2024.15 or above.
Cloudflare account. See https://dash.cloudflare.com/sign-up.
Cloudflare R2 Admin role. See the Cloudflare roles documentation.
CREATE STORAGE CREDENTIAL
privilege on the Unity Catalog metastore attached to the workspace. Account admins and metastore admins have this privilege by default.CREATE EXTERNAL LOCATION
privilege on both the metastore and the storage credential referenced in the external location. Metastore admins have this privilege by default.CREATE MANAGED STORAGE
privilege on the external location.CREATE CATALOG
on the metastore. Metastore admins have this privilege by default.
Limitations for Cloudflare R2
Providers can’t share R2 tables that use liquid clustering and V2 checkpoint.
Mount an R2 bucket as an external location in Databricks
Create a Cloudflare R2 bucket.
Create a storage credential in Unity Catalog that gives access to the R2 bucket.
Use the storage credential to create an external location in Unity Catalog.
See Create an external location to connect cloud storage to Databricks.
Create a new catalog using the external location
Create a catalog that uses the new external location as its managed storage location.
See Create catalogs.
When you create the catalog, do the following:
Select a Standard catalog type.
Under Storage location, select Select a storage location and enter the path to the R2 bucket you defined as an external location. For example,
r2://mybucket@my-account-id.r2.cloudflarestorage.com
Use the path to the R2 bucket you defined as an external location. For example:
CREATE CATALOG IF NOT EXISTS my-r2-catalog
MANAGED LOCATION 'r2://mybucket@my-account-id.r2.cloudflarestorage.com'
COMMENT 'Location for managed tables and volumes to share using Delta Sharing';