Skip to main content

Managed disaster recovery

Preview

This feature is in Public Preview.

Managed disaster recovery (DR) replicates your Databricks deployment to a secondary region so you can recover from a regional outage in minutes. Databricks manages the replication pipeline, the state of the replicated catalogs in the secondary, and the failover process. You do not write or maintain replication scripts.

For the manual approach to disaster recovery, including general DR concepts and best practices, see Disaster recovery.

important

Managed DR is gated. Apply for access through your Databricks account team. Databricks enables managed DR on your account after you are accepted.

What is managed disaster recovery?

Managed DR sits on top of the workspaces and metastores you already operate. You bring two Databricks workspaces, one in your primary region and one in your secondary region, and a metastore in each region. Managed DR then:

  • Replicates the categories you opt into from the primary to the secondary on a continuous schedule. Both categories are independently optional: Unity Catalog metadata and managed table data, and workspace assets such as notebooks, jobs, SQL warehouses, clusters, and ACLs.
  • Provides an optional stable URL, a single connection string that always points to the current primary, so clients keep working after failover without reconfiguration.
  • Enables you to trigger failover when you want to, for a DR test or a real outage.

Workspace asset IDs are preserved across regions, so URLs that reference a workspace asset by ID still resolve after failover.

What gets replicated

Managed DR can replicate the following on every replication cycle. Both categories are optional, so you can enable either or both:

  • Unity Catalog metadata and data: Unity Catalog managed tables in Delta Lake with data, external tables and volumes (metadata only), views, functions, and all permissions grants. The catalog's isolation mode is replicated. If the source catalog is open, the replica is open. If the source is isolated and bound to the primary workspace, the replica is isolated and bound to the secondary workspace.
  • Workspace assets: Notebooks, jobs, SQL warehouses, clusters, draft AI/BI dashboards, files, and folders, along with their ACLs. SQL warehouses are replicated in STOPPED state, clusters in TERMINATED state. Job schedules in the secondary are paused.

Anything not listed in the preceding section is not replicated. See Limitations for details.

Requirements

  • The Mission Critical workspace add-on enabled on both workspaces. See Enable Mission Critical on both workspaces.
  • Account admin role with ALL PRIVILEGES on every external location used by the catalogs you plan to replicate.
  • Account-level SSO with all workspaces enabled, and identities synced to the account through SCIM so users, groups, and service principals exist in both regions.
  • For stable URLs: a custom URL provisioned for your Databricks domain (contact your account team) and account-level OAuth.
  • A secondary workspace and Unity Catalog metastore in the secondary region, in the same Databricks account and on the same cloud as your primary. The secondary workspace must match the primary's network, Private Link, and customer-managed key configuration. The secondary metastore must not contain catalogs that share names with replicated catalogs. For workspace asset replication, Databricks deletes any existing in-scope assets in the secondary workspace when initial replication completes. Out-of-scope assets are not affected, so the secondary workspace does not need to be empty.
  • A corresponding external location and storage credential in the secondary region for each one referenced by your primary catalogs. Managed DR does not replicate external locations or storage credentials automatically; you must create them in the secondary.
  • ALL PRIVILEGES on all external locations referenced by your replicated catalogs, granted to the IAM roles used by the secondary workspace.

Enable Mission Critical on both workspaces

Enable the Mission Critical add-on on both your primary and secondary workspaces before you create a failover group. Compute usage on each workspace where you enable the add-on is billed at the Mission Critical rate. Contact your Databricks account team for the current rate.

  1. In the account console, click Workspaces, then click the workspace.
  2. Click the Add-ons tab.
  3. On the Mission Critical card, turn on the toggle and confirm.

Repeat for the secondary workspace.

Set up replication

A new failover group transitions through CREATINGINITIAL_REPLICATIONACTIVE. The first replication cycle copies all in-scope data to the secondary. For large workspaces, the initial workspace asset bootstrap can take up to two weeks. This wait is one-time. After the initial bootstrap completes, replication runs continuously.

During replication, secondary in-scope catalogs are read-only and compute is unavailable in the secondary workspace. To run validation queries without writing to the secondary, Databricks recommends a separate read-only monitor workspace in the secondary region.

To create a failover group:

  1. In the account console, click Resilience.
  2. If you plan to use a stable URL, click the Stable URLs tab, then Create stable URL. Enter a name, select the current primary workspace, and create the stable URL. Point downstream clients (JDBC, ODBC, the Databricks web UI, direct API requests) at the stable URL instead of the original workspace URL.
  3. Click the Failover groups tab, then Create failover group.
  4. Fill in the form:
    • Failover group name: A name you choose for the failover group.
    • Primary workspace: The workspace that is your primary.
    • Secondary workspace: The workspace in the secondary region.
    • Replicate workspace assets (optional): Off by default. Turn on to replicate notebooks, jobs, SQL warehouses, clusters, dashboards, files, and folders (and their ACLs) from the primary to the secondary. Requires both workspaces to have the Mission Critical add-on enabled. If you turn on workspace asset replication, Databricks deletes any existing in-scope assets in the secondary when initial replication completes. Out-of-scope assets are not affected.
    • Stable URL (optional): The stable URL you created in step 2.
    • Replication scope: The catalogs to replicate. You must select a primary workspace before this field is available.
    • Storage mappings: For each external location that your replicated catalogs use in the primary region, add an entry that maps its storage path to the corresponding external location you created in the secondary region (see Requirements). You can use * as a wildcard for prefix matching.
  5. Click Create failover group.

For example, an AWS storage mapping might map s3://primary-bucket/data/* to s3://secondary-bucket/data/*.

Use the stable URL

The stable URL always resolves to the current primary workspace, so clients that connect through it do not need to be reconfigured after a failover. Point the following downstream clients at the stable URL instead of the original workspace URL:

  • The Databricks web UI.
  • JDBC and ODBC connections to SQL warehouses.
  • Direct REST API requests.

Stable URLs are supported with front-end (inbound) Private Link, but the URL format differs from the standard form.

The original workspace URL keeps working, but it does not survive failover. Databricks recommends migrating clients to the stable URL.

Monitor replication

The Failover groups tab shows each failover group's current state, replication point, and any active errors. Possible states:

State

Meaning

CREATING

The failover group is being provisioned.

INITIAL_REPLICATION

The first replication cycle is in progress. Failover is not yet available.

ACTIVE

Replication is in steady state. Failover is available.

FAILING_OVER

A failover is in progress.

FAILOVER_FAILED, CREATION_FAILED, DELETION_FAILED

The operation did not complete. Check the failover group's status details for guidance.

Select a failover group's name to open its detail page. The replication point shows when all in-scope resources were last copied. Data written after the replication point might not exist in the secondary and can be lost during failover.

For historical RPO trends and per-asset replication errors, query the system.replication.states system table. See Replication system table reference.

Fail over and fail back

The same procedure covers planned failovers (DR tests, scheduled maintenance) and unplanned failovers (a regional outage). To fail back, repeat the procedure with the regions reversed.

When you trigger a failover, Databricks:

  • Points the stable URL, if attached, to the new primary region.
  • Reverses the direction of replication.
  • Pauses job schedules in the former primary.
  • Transitions the failover group through FAILING_OVER to INITIAL_REPLICATION.

To fail over:

  1. Notify your team that a failover is starting.

  2. For a planned failover only:

    1. In the primary workspace, terminate all running clusters and stop all SQL warehouses.
    2. Confirm that writes to the primary have stopped, then wait for replication to catch up. To check, open the failover group's detail page and confirm the replication point is within a few seconds of the time you stopped writes.
  3. In the account console, click ResilienceFailover groups, then click the failover group's name.

  4. Click Fail over.

  5. Select the new primary region and confirm. The failover completes in minutes.

  6. In the new primary, start the compute that was running before the failover. Replicated clusters and SQL warehouses arrive in the new primary in TERMINATED and STOPPED state, respectively.

  7. Manually resume the job schedules you need in the new primary. The former primary's schedules are already paused.

Clients connected through the stable URL continue working after the failover. Repoint clients that still use the original workspace URL to either the stable URL or the new primary's workspace URL.

important

In an unplanned failover, data written to the primary after the last replication point might be lost. Confirm that any loss falls within your RPO target.

tip

Test failover regularly, such as once per quarter, so your team is familiar with the procedure before an actual outage.

Common replication errors

Replication errors appear on the failover group's detail page in the Resilience section of the account console, and in the system.replication.states system table. The following errors are the most common:

Error

Resolution

Missing external location

Create the missing external location in the secondary metastore, or update the failover group's storage mappings to include the path.

Network setup error

Verify the network connectivity setup of the secondary workspace, including any required network connectivity configurations and approved private endpoints.

Missing dependency

A view in the failover group depends on a catalog that is not in the failover group. Add the dependency catalog, or remove the dependent view.

Request limit exceeded

The system retries automatically. Contact Databricks support if the error persists.

Missing source

Informational. The asset no longer exists in the primary; no action required.

Tear down managed DR

  1. In the account console, click ResilienceFailover groups, then click the failover group's name and delete it. You cannot turn off Mission Critical while a failover group is active on the workspace.
  2. To stop billing at the Mission Critical rate, turn off Mission Critical on each workspace from the Add-ons tab.

Limitations

Managed DR has the following limitations:

  • Not replicated: materialized views, streaming tables, Lakeflow Spark Declarative Pipelines pipelines, managed volume data (metadata replicates), Unity Catalog and workspace secrets, ML models, model serving endpoints, vector search indexes, Delta shares, published AI/BI dashboards (drafts replicate), and Spark Structured Streaming outside Lakeflow Spark Declarative Pipelines. Tables with row filters or column masks and ABAC-tagged resources are flagged as failed to replicate in the system table, and these failures hold up RPO until you remove the resource from the failover group's scope.
  • Secondary in-scope catalogs are read-only. Read-only applies only to replicated entities. You can still set up your own replication for securables outside the managed DR scope. However, you cannot run compute on the secondary workspace while managed DR is enabled, which limits operating a do-it-yourself replication pipeline there.
  • Renaming a Unity Catalog securable triggers a delete and recreate in the secondary. For managed tables, the rename re-replicates the table data on the next cycle. Avoid renaming during steady-state replication.
  • UNDROP is not propagated to the secondary.
  • Maximum 300 catalogs per account.
  • Maximum 100 failover groups per account.
  • Initial workspace asset bootstrap can take up to 2 weeks for large workspaces.

For DR concepts and best practices, see Disaster recovery.