Managed disaster recovery

Managed disaster recovery (DR) replicates your Databricks deployment to a secondary region so you can recover from a regional outage in minutes. Databricks manages the replication pipeline, the state of the replicated catalogs in the secondary, and the failover process. You do not write or maintain replication scripts.

For the manual approach to disaster recovery, including general DR concepts and best practices, see Disaster recovery.

important

Managed DR is gated. Apply for access through your Databricks account team. Databricks enables managed DR on your account after you are accepted.

What is managed disaster recovery?

Managed DR sits on top of the workspaces and metastores you already operate. You bring two Databricks workspaces, one in your primary region and one in your secondary region, and a metastore in each region. Managed DR then:

Replicates the categories you opt into from the primary to the secondary on a continuous schedule. Both categories are independently optional: Unity Catalog metadata and managed table data, and workspace assets such as notebooks, jobs, SQL warehouses, clusters, and ACLs.
Provides an optional stable URL, a single connection string that always points to the current primary, so clients keep working after failover without reconfiguration.
Enables you to trigger failover when you want to, for a DR test or a real outage.

Workspace asset IDs are preserved across regions, so URLs that reference a workspace asset by ID still resolve after failover.

What gets replicated

Managed DR can replicate the following on every replication cycle. Both categories are optional, so you can enable either or both:

Unity Catalog metadata and data: Unity Catalog managed tables in Delta Lake with data, external tables and volumes (metadata only), views, functions, and all permissions grants. The catalog's isolation mode is replicated. If the source catalog is open, the replica is open. If the source is isolated and bound to the primary workspace, the replica is isolated and bound to the secondary workspace.
Workspace assets: Notebooks, jobs, SQL warehouses, clusters, draft AI/BI dashboards, files, and folders, along with their ACLs. SQL warehouses are replicated in STOPPED state, clusters in TERMINATED state. Job schedules in the secondary are paused.

Ownership of replicated objects

When managed DR creates a replicated securable (catalog, schema, table, view, function, or volume) in the secondary, the initial owner is the Databricks service principal that runs the replication, because Unity Catalog assigns ownership to the identity that creates the object. Managed DR then transfers ownership of the replica to match the owner of the corresponding securable in the primary.

If the owner of a securable in the primary is a user who has been deleted from the account, managed DR cannot transfer ownership to a principal that no longer exists. In this case, the replica securable keeps the Databricks service principal as its owner. To resolve this, assign a valid owner to the securable in the primary and let DR replicate it.

Requirements

A workspace on the Enterprise plan in both the primary and secondary regions.

The Mission Critical workspace add-on enabled on both workspaces. See Enable Mission Critical on both workspaces.
Serverless compute enabled on both workspaces. Serverless compute is available by default in most Unity Catalog-enabled workspaces. See Connect to serverless compute.
Account admin role with ALL PRIVILEGES on every external location used by the catalogs you plan to replicate.
Account-level SSO with all workspaces enabled, and identities synced to the account through SCIM so users, groups, and service principals exist in both regions.
For stable URLs: a custom URL provisioned for your Databricks domain (contact your account team) and account-level OAuth.
A secondary workspace and Unity Catalog metastore in the secondary region, in the same Databricks account and on the same cloud as your primary. The secondary workspace must match the primary's network, Private Link, and customer-managed key configuration. The secondary metastore must not contain catalogs that share names with replicated catalogs. For workspace asset replication, Databricks deletes any existing in-scope assets in the secondary workspace when initial replication completes. Out-of-scope assets are not affected, so the secondary workspace does not need to be empty.
A corresponding external location and storage credential in the secondary region for each one referenced by your primary catalogs. Managed DR does not replicate external locations or storage credentials automatically; you must create them in the secondary.

Because the secondary workspace's serverless compute reads from the source storage during cross-region replication, both the source and secondary storage must allow Databricks serverless network access in both directions.

If you restrict network access to your source storage or DBFS root, also allow the secondary region's control plane IP addresses at the source storage firewall, and the primary region's control plane IP addresses at the secondary DBFS firewall. For the control plane IP addresses to allow in each region, see Inbound IPs.

ALL PRIVILEGES on all external locations referenced by your replicated catalogs, granted to the IAM roles used by the secondary workspace.

Enable Mission Critical on both workspaces

Enable the Mission Critical add-on on both your primary and secondary workspaces before you create a failover group. Compute usage on each workspace where you enable the add-on is billed at the Mission Critical rate. Contact your Databricks account team for the current rate.

In the account console, click Workspaces, then click the workspace.
Click the Add-ons tab.
On the Mission Critical card, turn on the toggle and confirm.

Repeat for the secondary workspace.

Optional: stable URL

Databricks recommends using the stable URL. The stable URL always resolves to the current primary workspace, so clients that connect through it do not need to be reconfigured after a failover. The original workspace URL remains valid for direct access to that workspace, but after a failover it keeps pointing to the old primary, now the secondary. Point the following downstream clients at the stable URL instead of the original workspace URL:

The Databricks web UI.
JDBC and ODBC connections to SQL warehouses.
Direct REST API requests.

Stable URLs are supported with front-end (inbound) Private Link. With inbound Private Link, the stable URL uses your custom URL with a stable connection ID rather than the standard workspace URL format.

For example, the stable URL takes the form <my-custom-url>.databricks.com/?c=stable_connection_id. See Configure inbound Private Link for workspaces with managed disaster recovery.

Set up replication

A new failover group transitions through CREATING → INITIAL_REPLICATION → ACTIVE. The first replication cycle copies all in-scope data to the secondary. For large workspaces, the initial workspace asset bootstrap can take up to two weeks. This wait is one-time. After the initial bootstrap completes, replication runs continuously.

During replication, secondary in-scope catalogs are read-only and compute is unavailable in the secondary workspace. To run validation queries without writing to the secondary, Databricks recommends a separate read-only monitor workspace in the secondary region.

To create a failover group:

In the account console, click Resilience.
If you plan to use a stable URL, click the Stable URLs tab, then Create stable URL. Enter a name, select the current primary workspace, and create the stable URL. Point downstream clients (JDBC, ODBC, the Databricks web UI, direct API requests) at the stable URL instead of the original workspace URL.
Click the Failover groups tab, then Create failover group.
Fill in the form:
- Failover group name: A name you choose for the failover group.
- Primary workspace: The workspace that is your primary.
- Secondary workspace: The workspace in the secondary region.
- Replicate workspace assets (optional): Off by default. Turn on to replicate notebooks, jobs, SQL warehouses, clusters, dashboards, files, and folders (and their ACLs) from the primary to the secondary. Requires both workspaces to have the Mission Critical add-on enabled. If you turn on workspace asset replication, Databricks deletes any existing in-scope assets in the secondary when initial replication completes. Out-of-scope assets are not affected.
- Stable URL (optional): The stable URL you created in step 2.
- Replication scope: The catalogs to replicate. You must select a primary workspace before this field is available.
- Storage mappings: For each external location that your replicated catalogs use in the primary region, add an entry that maps its storage path to the corresponding external location you created in the secondary region (see Requirements). You can use * as a wildcard for prefix matching.
Click Create failover group.

For example, an AWS storage mapping might map s3://primary-bucket/data/* to s3://secondary-bucket/data/*.

Resources created by managed DR

When you create a failover group, managed DR provisions auxiliary Unity Catalog resources that the replication pipeline uses to copy data between regions. In both the primary and secondary metastores, managed DR creates:

A connection that points to the workspace in the other region.
A foreign catalog for each replicated catalog. The foreign catalog references the corresponding catalog in the other region.

These resources appear alongside your own catalogs in Catalog Explorer. You can identify them by their comment, which notes that Databricks disaster recovery created and manages them.

important

By default, only a metastore admin can modify or delete these resources. Do not delete the connections or foreign catalogs that managed DR creates. Deleting either one breaks replication for the failover group.

Stable workspace ID

Some tools identify a workspace by its workspace ID instead of its URL, including the Databricks Terraform provider and Databricks Asset Bundles. Each stable URL has a stable workspace ID that resolves to the current primary, so these tools keep targeting the active workspace after a failover. Use the stable workspace ID wherever a tool asks for a workspace ID, in the same way you would use a regular workspace ID.

To find the stable workspace ID, list the stable URLs for your account with the Databricks CLI and read the stable_workspace_id field of the relevant stable URL:

Bash
databricks api get /api/disaster-recovery/v1/accounts/<account-id>/stable-urls

Deploy with Databricks Asset Bundles and Terraform

Databricks Asset Bundles (DABs) and the Databricks Terraform provider target a workspace either by its workspace host URL, or by a combination of the custom URL of the Databricks account and the workspace ID. To keep deploying to the current primary after a failover, set the host to your custom URL — the host portion of the stable URL, not the original per-workspace URL — and specify the stable workspace ID in the workspace_id field. Together they resolve to the current primary, so your CI/CD pipelines keep deploying to the active workspace after a failover, with no configuration change.

New deployments: use the custom URL and stable workspace ID from the first deployment.
Existing deployments: import the state from your previous Terraform project into a new project configured with the custom URL and stable workspace ID, then remove the previous project. Do not repoint an existing project in place — the deployment no longer recognizes the resources it created against the original per-workspace URL, so a redeploy destroys and recreates them.
DABs: enable workspace asset replication on the failover group. A bundle stores its deployment state in the workspace, and that state reaches the new primary only as part of workspace asset replication.

note

After a failover, the first redeploy recreates any resources that managed DR does not replicate, because they do not exist on the new primary. Replicated resources are left in place. See Limitations for what managed DR does and does not replicate.

Monitor replication

The Failover groups tab shows each failover group's current state, replication point, and any active errors. Possible states:

State	Meaning
`CREATING`	The failover group is being provisioned.
`INITIAL_REPLICATION`	The first replication cycle is in progress. Failover is not yet available.
`ACTIVE`	Replication is in steady state. Failover is available.
`FAILING_OVER`	A failover is in progress.
`FAILOVER_FAILED`, `CREATION_FAILED`, `DELETION_FAILED`	The operation did not complete. Check the failover group's status details for guidance.

State	Meaning
`CREATING`	The failover group is being provisioned.
`INITIAL_REPLICATION`	The first replication cycle is in progress. Failover is not yet available.
`ACTIVE`	Replication is in steady state. Failover is available.
`FAILING_OVER`	A failover is in progress.
`FAILOVER_FAILED`, `CREATION_FAILED`, `DELETION_FAILED`	The operation did not complete. Check the failover group's status details for guidance.

Select a failover group's name to open its detail page. Replication runs continuously, but the replication point shows the last time all in-scope resources were copied together. Individual resources might be more current, but not all data after the replication point might exist in the secondary and can be lost during failover.

To monitor historical RPO trends and see the errors that are blocking replication, query the system.replication.states system table. See Replication system table reference. For the most common error classes and how to resolve them, see Reference.

Fail over and fail back

The same procedure covers planned failovers (DR tests, scheduled maintenance) and unplanned failovers (a regional outage). To fail back, repeat the procedure with the regions reversed.

When you trigger a failover, Databricks:

Points the stable URL, if attached, to the new primary region.
Reverses the direction of replication.
Pauses job schedules in the former primary.
Transitions the failover group through FAILING_OVER to INITIAL_REPLICATION.

To fail over:

Notify your team that a failover is starting.
For a planned failover only:
1. In the primary workspace, terminate all running clusters and stop all SQL warehouses.
2. Confirm that writes to the primary have stopped, then wait for replication to catch up. To check, open the failover group's detail page and confirm the replication point is within a few seconds of the time you stopped writes.
In the account console, click Resilience → Failover groups, then click the failover group's name.
Click Fail over.
Select the new primary region and confirm. The failover completes in minutes.
In the new primary, start the compute that was running before the failover. Replicated clusters and SQL warehouses arrive in the new primary in TERMINATED and STOPPED state, respectively.
Manually resume the job schedules you need in the new primary. The former primary's schedules are already paused.

Clients connected through the stable URL continue working after the failover. Repoint clients that still use the original workspace URL to either the stable URL or the new primary's workspace URL.

important

In an unplanned failover, data written to the primary after the last replication point might be lost. Confirm that any loss falls within your RPO target.

tip

Test failover regularly, such as once per quarter, so your team is familiar with the procedure before an actual outage.

Tear down managed DR

In the account console, click Resilience → Failover groups, then click the failover group's name and delete it. You cannot turn off Mission Critical while a failover group is active on the workspace.
To stop billing at the Mission Critical rate, turn off Mission Critical on each workspace from the Add-ons tab.

Limitations

Managed DR has the following limitations:

Not replicated: materialized views, streaming tables, Lakeflow pipelines, managed volume data (metadata replicates), Unity Catalog and workspace secrets, ML models, model serving endpoints, vector search indexes, Delta shares, published AI/BI dashboards (drafts replicate), and Spark Structured Streaming outside Lakeflow pipelines. Tables with row filters or column masks and ABAC-tagged resources are flagged as Failed to replicate in the system table, and these failures hold up RPO until you remove the resource from the failover group's scope.
Secondary in-scope catalogs are read-only. Read-only applies only to replicated entities. You can still set up your own replication for securables outside the managed DR scope. However, you cannot run compute on the secondary workspace while managed DR is enabled, which limits operating a do-it-yourself replication pipeline there.
Renaming a Unity Catalog securable triggers a delete and recreate in the secondary. For managed tables, the rename re-replicates the table data on the next cycle. Avoid renaming during steady-state replication.
UNDROP is not propagated to the secondary.
Maximum 300 catalogs per account.
Maximum 100 failover groups per account.
Initial workspace asset bootstrap can take up to 2 weeks for large workspaces.

Reference

When a resource cannot be replicated, the failover group surfaces an error class in the system.replication.states system table, along with a message identifying the affected resource. The following sections cover the most common error classes and how to resolve them. Replication recovers automatically after you correct the underlying problem.

DR_MISSING_DEPENDENCY

An asset references a dependency that does not exist in the secondary, so the asset cannot be replicated. The subclass identifies the missing dependency type and appears as DR_MISSING_DEPENDENCY.CATALOG, .SCHEMA, .TABLE, or .RESOURCE. The resolution is the same for all of them.

Check whether the asset is also broken in the primary because of the missing dependency. If it is, fix or remove the asset in the primary.
If the asset is valid in the primary, the dependency is either not in the replication scope of any failover group, or it is in scope of this or another failover group but failed to replicate. If the dependency is not in scope, edit the failover group's replication scope to also replicate it. If the dependency is already in scope, check system.replication.states for the error blocking its replication and resolve that error.

DR_INVALID_CONFIGURATION.MISSING_LOCATION_MAPPING

Managed DR decides where to place each replicated asset by applying the failover group's storage mappings to the asset's source storage location. A mapping matches a location exactly, or as a prefix that also covers child paths. This error means no mapping covers a source storage location, so managed DR cannot determine where in the secondary to place the asset. For external tables and volumes, a missing mapping means the same location URI is used on the primary and secondary. The storage_location in the message is the unmapped source path.

In the account console, go to Resilience → Failover groups and edit the failover group.
Under Storage mappings, add or widen a mapping so it covers the source location in the message. To cover child paths, map a parent path and add the /* suffix for prefix matching. See storage mappings.
Confirm an external location in the secondary metastore already covers the mapping's target path. The failover group rejects a mapping whose target is not under an existing external location, so create that external location first if it does not exist. See Connect to cloud object storage using Unity Catalog.

DR_INVALID_CONFIGURATION.MISSING_EXTERNAL_LOCATION

A storage mapping resolved a replicated asset to a target path in the secondary, but no external location in the secondary metastore covers that path, so Unity Catalog has nowhere to place the asset's data. The storage_location in the message is the uncovered secondary (target) path.

This usually means one of two things: an external location that previously covered the path was removed or narrowed, or a newly replicated asset resolves to a secondary path that no external location covers. The second case happens, for example, when you create an external table in the primary under a storage path that none of your storage mappings cover. Managed DR then falls back to the table's original path, which no external location in the secondary metastore covers, so the data has nowhere to land.

Identify the uncovered secondary path from the message's storage_location.
Decide which external location in the secondary metastore should cover that path: an existing external location that you extend, or a new one that you create.
Either adjust the failover group's storage mappings so the path resolves under an external location that already exists, or create the external location (with its storage credential) and extend the mapping to point to it. See Connect to cloud object storage using Unity Catalog.

DR_INTERNAL_ERROR

A system-side fault occurred during replication. No action is required; the system recovers automatically. Contact Databricks support if the problem does not resolve on its own.

DR_INVALID_CONFIGURATION.CROSS_CATALOG_VIEW_PERMISSION

Managed DR replicates the view along with its grants, but a view that references objects in other catalogs also needs its owner to have access to those referenced objects in the secondary, because the view runs with the owner's privileges. This error means the owner is missing that access on the secondary, so you must grant it on the referenced objects there.

Find the objects the view references and the view's owner. Referenced objects appear as fully qualified catalog.schema.object names in the definition; grants must go to the owner, which you can also read from the Owner field in Catalog Explorer.
SQL
```
SHOW CREATE TABLE <catalog>.<schema>.<view>;
```

On the secondary, check the owner's current privileges on each referenced object. Reading a table requires USE CATALOG on its catalog, USE SCHEMA on its schema, and SELECT on the table.

SQL
SHOW GRANTS `<view_owner>` ON CATALOG <ref_catalog>;
SHOW GRANTS `<view_owner>` ON SCHEMA <ref_catalog>.<ref_schema>;
SHOW GRANTS `<view_owner>` ON TABLE <ref_catalog>.<ref_schema>.<ref_table>;

Grant the view owner any missing privileges on each referenced object.

SQL
GRANT USE CATALOG ON CATALOG <ref_catalog> TO `<view_owner>`;
GRANT USE SCHEMA ON SCHEMA <ref_catalog>.<ref_schema> TO `<view_owner>`;
GRANT SELECT ON TABLE <ref_catalog>.<ref_schema>.<ref_table> TO `<view_owner>`;

Confirm that every catalog the view references is included in a failover group's replication scope, so it also exists in the secondary.

For more information, see Manage privileges in Unity Catalog.

DR_INVALID_CONFIGURATION.NETWORK_UNAUTHORIZED_ACCESS

During cross-region table-data replication, the secondary workspace's serverless compute reads data from the source storage, and the storage denied the network connection: a storage firewall or network rule blocked it, or a required private endpoint is missing or unapproved.

Verify that your source and secondary storage allow Databricks serverless network access, as described in Requirements.

DR_INVALID_CONFIGURATION.SERVERLESS_COMPUTE_PERMISSION

Managed DR uses serverless compute in the secondary workspace to copy data, and serverless compute is not permitted there. This usually means serverless is turned off for the account or workspace, or the workspace is not eligible.

Confirm the secondary workspace is eligible. Serverless compute is available by default in Unity Catalog-enabled workspaces in a supported region. See Connect to serverless compute.
Check for an account-wide opt-out. In the account console, go to Settings → Feature enablement and check whether the serverless toggle is present and off.
Enable serverless for the scope you need. To enable every eligible workspace, an account admin turns the account-level serverless toggle on. To enable only the secondary workspace, leave the account-level toggle off and have a workspace admin enable serverless from the workspace's Previews.
If no toggle is available, or serverless still does not run after you enable it, contact your Databricks account team.

DR_INVALID_CONFIGURATION.OVERLAPPING_EXTERNAL_LOCATIONS

When managed DR creates a replicated object at its mapped target path in the secondary, Unity Catalog rejects the path because it overlaps storage that already exists there, such as an existing external location, a leftover securable from an earlier or partial setup, a managed location, or the workspace's default (DBFS) storage.

Identify what occupies the path.

See Resolve storage path conflicts.
If the conflicting object should not own the path, remove it. A common cause is a leftover external table or external volume from a previous setup; drop it if it is no longer needed. If it is an external location that should not cover the path, remove or redefine it.
Otherwise, repoint the failover group's storage mapping to a dedicated, non-overlapping target path. Prefer a specific subpath over a broad bucket root, and avoid the workspace's default (DBFS) storage.

DR_UNSUPPORTED_FEATURE

The asset uses a feature that managed DR cannot replicate. The subclass identifies the unsupported feature and appears, for example, as DR_UNSUPPORTED_FEATURE.ABAC_POLICY. There are two ways to resolve this error.

Remove the unsupported feature from the asset in the primary workspace.
If you cannot remove the feature, consider removing the asset from the failover group's replication scope.

For DR concepts and best practices, see Disaster recovery.

What is managed disaster recovery?​

What gets replicated​

Ownership of replicated objects​

Requirements​

Enable Mission Critical on both workspaces​

Optional: stable URL​

Set up replication​

Resources created by managed DR​

Stable workspace ID​

Deploy with Databricks Asset Bundles and Terraform​

Monitor replication​

Fail over and fail back​

Tear down managed DR​

Limitations​

Reference​

What is managed disaster recovery?

What gets replicated

Ownership of replicated objects

Requirements

Enable Mission Critical on both workspaces

Optional: stable URL

Set up replication

Resources created by managed DR

Stable workspace ID

Deploy with Databricks Asset Bundles and Terraform

Monitor replication

Fail over and fail back

Tear down managed DR

Limitations

Reference