Skip to main content

Disaster recovery

Disaster recovery (DR) for Databricks replicates workspaces, data, and configurations across cloud regions so your teams keep working when a regional outage takes your primary deployment offline. A complete DR plan covers not only Databricks but the data sources, ingestion tools, BI tools, and schedulers it connects to.

This page covers the concepts, strategies, tooling, and test procedures you need to design and run a cross-region DR solution.

New to DR planning? Start with Disaster recovery industry terminology for definitions of RPO and RTO.

important

Use managed disaster recovery. Databricks recommends managed disaster recovery for cross-region DR on AWS and Azure. It replicates Unity Catalog metadata, managed table data, and workspace assets on a continuous schedule, provides a stable URL that survives failover, and lets you trigger failover from the account console. No replication scripts to write or maintain. Use the DIY guidance on this page only for resources managed DR doesn't replicate, or if you require active-active topologies, cross-cloud replication, or fine-grained control over the replication pipeline.

Intra-region high availability guarantees

The rest of this page covers cross-region DR, but Databricks also provides high availability (HA) inside a single region. Understand these guarantees first. They determine whether you need a separate DR strategy.

HA and DR solve different problems:

  • HA uses availability zone (AZ) redundancy inside a region. If one zone fails, services keep running in the others.
  • DR uses inter-region replication. You run secondary Databricks workspaces in another region and replicate data and configurations to them, then fail over during a regional outage.

If you don't need multi-region DR, Databricks HA might be enough. HA avoids cross-region complexity but doesn't protect against a full-region outage. If you rely on HA alone for DR, verify your cloud region's separation and redundancy.

Intra-region HA guarantees cover the control plane and the compute plane.

Availability of the Databricks control plane

Availability of the Databricks control plane

The Databricks control plane is resilient to zone failures and recovers automatically within approximately 15 minutes of a zone failure. Regular zone-failure testing validates this.

All stateless control plane services can lose individual VMs, or all VMs in an entire zone, without taking the service down. Workspace data is stored in databases replicated across zones in the region. Storage accounts that serve Databricks Runtime images are also redundant inside the region, and all regions have secondary storage accounts that take over when the primary is down.

note

The control plane guarantees above apply to Databricks-managed infrastructure. You are responsible for compute-plane zone redundancy, for example, choosing zone-redundant storage for the workspace root bucket and using instance pools that span availability zones.

Zone failure resilience supports at most one zone being down in a region.

Availability of the compute plane

Availability of the compute plane

Workspace availability depends on the availability of the control plane.

DBFS root data is not affected unless the storage account is configured in a one-zone variant. Amazon S3 data is regional by default, with data redundancy across zones.

Cluster nodes are allocated in a single availability zone (AZ) from the AWS compute provider. If a node is lost, the cluster manager requests replacement nodes from the AWS compute provider in the same AZ. The exception is when the driver node is lost. In that case, the cluster manager restarts the job and cluster. If the AZ containing the cluster experiences a zone failure, the cluster manager restarts the job and cluster in a different AZ.

Terminology

Use these definitions consistently when discussing DR with your team.

Region terminology

Region terminology

This page uses the following region definitions:

  • Primary region: The region where users run daily interactive and automated data analytics workloads.

  • Secondary region: The region where IT teams move workloads temporarily during a primary-region outage.

  • Geo-redundant storage: Asynchronous cross-region replication of persisted storage. See your cloud's documentation:

    Geo-redundant storage across regions (AWS).

important

Do not rely on geo-redundant storage to duplicate Databricks root storage (such as the Amazon S3 bucket Databricks creates for each workspace) across regions. To replicate managed table data, use Delta Deep Clone, and for non-Delta data, convert to Delta first where possible.

Deployment status terminology

Deployment status terminology

This page uses the following deployment-status definitions:

  • Active deployment (sometimes called hot deployment): Users connect to it and run workloads. Jobs and data streams run here on schedule.

  • Passive deployment (sometimes called cold deployment): No processes run here. IT teams keep it ready by automating deployment of code, configuration, and other Databricks objects to it. A passive deployment becomes active only when the active deployment goes down.

    important

    A project can include multiple passive deployments in different regions for additional resilience.

Most teams run one active deployment at a time, the active-passive strategy. The less common active-active strategy runs two simultaneous active deployments.

Disaster recovery industry terminology

Disaster recovery industry terminology

Define these two industry terms with your team:

  • Recovery point objective (RPO): The maximum period of data loss your service can tolerate during a major incident. See RPO.

    Databricks doesn't store your primary customer data. That lives in Amazon S3 or other systems you control. The Databricks control plane stores some objects (such as jobs and notebooks), so the Databricks RPO is the maximum period in which changes to those objects can be lost. You're responsible for defining the RPO for your customer data in Amazon S3 and other data sources you control.

  • Recovery time objective (RTO): The maximum time within which a business process must be restored after a disaster. See RTO.

Disaster recovery and data corruption

Disaster recovery and data corruption

A DR solution does not mitigate data corruption. Corrupted data in the primary region is replicated to the secondary region and is corrupted in both regions. To mitigate this kind of failure, use Delta time travel, similar tools, or data backup tools.

Typical recovery workflow

A Databricks DR scenario typically plays out as follows:

  1. A failure hits a critical service in your primary region: a data source, a network, or another dependency the Databricks deployment relies on.
  2. You investigate with your cloud provider.
  3. If the wait is unacceptable, you decide to fail over to your secondary region.
  4. Confirm the same problem doesn't affect your secondary region.
  5. Fail over (for detailed steps, see Test failover):
    1. Stop all workspace activity. Users stop workloads and back up recent changes where possible. Jobs shut down (if the outage hasn't already failed them).
    2. Run the secondary-region recovery procedure to update routing and redirect connections and network traffic.
    3. Repoint downstream systems (BI tools, schedulers, third-party integrations) to the secondary workspace and resume their connections.
    4. After testing, declare the secondary region operational. Users log in to the now-active deployment, and you retrigger scheduled or delayed jobs.
  6. After the primary-region issue is mitigated, confirm the fix.
  7. Fail back (for details, see Test restore (failback)):
    1. Stop all work in the secondary region.
    2. Run the primary-region recovery procedure to redirect routing back.
    3. Replicate any new data back to the primary region. Minimize what needs to replicate. For example, read-only jobs that ran in the secondary deployment might not require write-back.
    4. Test the primary-region deployment.
    5. Declare the primary region active and resume production workloads.
important

Some data loss can occur during these steps. Define how much loss is acceptable for your organization, and how you mitigate it.

Step 1: Understand your business needs

Identify which data services are critical and define their target RPO and RTO. Research each system's real-world tolerance.

DR, failover, and failback carry real costs and risks, including data corruption, data duplication (writing to the wrong storage location), and users making changes in the wrong region.

Map every Databricks integration point that affects your business, and choose the tools and communication channels your plan uses.

Integration points to map

  • Does your DR solution need to accommodate interactive processes, automated processes, or both?
  • Which data services do you use? Some might be on-premises.
  • How does input data get to the cloud?
  • Who consumes this data? What processes consume it downstream?
  • Are there third-party integrations that need to be aware of DR changes?

Tools and communication to plan

  • Can you predefine your configuration and make it modular to accommodate DR solutions in a natural and maintainable way?
  • Which communication tools and channels notify internal teams and third parties (integrations, downstream consumers) of DR failover and failback changes? How do you confirm their acknowledgement?
  • What services, if any, do you shut down until complete recovery is in place?

Step 2: Choose a process that meets your business needs

Default to managed disaster recovery. It handles workspace replication, Unity Catalog metadata, managed-table data, and failover orchestration without custom scripts. Use the DIY guidance below only if you fall outside its scope, for example, resources managed DR doesn't replicate, active-active topologies, cross-cloud replication, or fine-grained control over the replication pipeline.

A DIY solution must replicate the correct data across the control plane, compute plane, and data sources. Redundant workspaces map to different control planes in different regions, so you keep them in sync with a script-based solution, either a synchronization tool or a CI/CD workflow. For the data itself, most teams use Databricks jobs (often scheduled) or Delta Deep Clone to copy tables between regions. You don't need to sync data from within the compute plane (such as from Databricks Runtime workers).

If you use the customer-managed VPC feature (not available with all subscription and deployment types), deploy networks consistently in both regions using template-based tooling like Terraform.

Replicate your data sources across regions as needed.

DR solutions typically involve two (or more) workspaces. Choose between the following strategies based on the disruption length you must tolerate, operational effort, and the cost to fail back to the primary region.

General best practices

General best practices

General best practices for a successful DR plan include:

  1. Understand which processes are critical to the business and must run in DR.
  2. Clearly identify which services are involved, which data is being processed, what the data flow is, and where it is stored.
  3. Isolate the services and data as much as possible. For example, create a special cloud storage container for the DR data or move Databricks objects needed during a disaster to a separate workspace.
  4. You are responsible for maintaining integrity between primary and secondary deployments for objects not stored in the Databricks control plane.
  5. For data sources, use native AWS tools to replicate data to your DR regions where possible.
warning

Don't store data in the root Amazon S3 bucket used for DBFS root access. DBFS root storage is unsupported for production customer data. Databricks also recommends against storing libraries, configuration files, or init scripts there.

Active-passive solution strategy

Active-passive solution strategy

This section focuses on the active-passive strategy because it's the most common, the simplest, and the most cost-effective. An active-passive solution syncs data and object changes from your active deployment to a passive deployment in a secondary region. During a DR event, the passive deployment becomes active.

Two common variants:

  • Unified (enterprise-wide): One set of active and passive deployments supports the entire organization.
  • By department or project: Each domain maintains its own DR solution with primary and secondary regions tailored to its needs.

You can also use a passive deployment for read-only workloads, such as user queries, that don't modify data or Databricks objects.

Active-active solution strategy

Active-active solution strategy

In an active-active solution, all data processes run in both regions in parallel at all times. Your operations team must mark each job complete only after it succeeds in both regions. Objects can't change in production and must follow strict CI/CD promotion from dev/staging to production.

Active-active is the most complex strategy and costs more because jobs run in both regions, but it offers the lowest RTO and RPO.

You can implement active-active enterprise-wide or by department. You don't need a duplicate workspace for every workload. For example, dev or staging workspaces are often easier to reconstruct from a development pipeline than to keep in sync.

Choose your tooling

Choose your tooling

There are two main approaches for keeping data in sync between workspaces in your primary and secondary regions:

  • Synchronization client that copies from primary to secondary: A sync client pushes production data and assets from the primary region to the secondary region. Typically, this runs on a scheduled basis, and the schedule frequency depends on your target RTO and RPO.
  • CI/CD tooling for parallel deployment: For production code and assets, use CI/CD tooling that pushes changes to production systems simultaneously to both regions. For example, when pushing code and assets from staging/development to production, a CI/CD system makes it available in both regions at the same time. The core idea is to treat all artifacts in a Databricks workspace as infrastructure-as-code. Most artifacts could be co-deployed to both primary and secondary workspaces, while some artifacts might need to be deployed only after a DR event. For tools, see Automation scripts, samples, and prototypes.

Depending on your needs, you can combine the approaches. For example, use CI/CD for notebook source code, but use synchronization for configuration like pools and access controls.

The following table describes how to handle each type of data with each tooling option.

Description

How to handle with CI/CD tooling

How to handle with sync tool

Source code: notebook source exports and source code for packaged libraries

Co-deploy both to primary and secondary.

Synchronize source code from primary to secondary.

Users and groups

Manage metadata as config in Git. Alternatively, use the same identity provider (IdP) for both workspaces. Co-deploy user and group data to primary and secondary deployments.

Use SCIM or other automation for both regions. Manual creation is not recommended, but if used must be done for both at the same time. If you use a manual setup, create a scheduled automated process to compare the list of users and groups between the two deployments.

Pool configurations

Can be templates in Git. Co-deploy to primary and secondary. However, min_idle_instances in secondary must be zero until the DR event.

Pools created with any min_idle_instances when they are synced to a secondary workspace using the API or CLI.

Job configurations

Use Databricks Asset Bundles with per-environment targets (for example, prod and dr) to deploy the same job definition to both regions. For the secondary deployment, set concurrency to zero so the job is staged but doesn't run. Change the concurrency value after the secondary deployment becomes active.

If the jobs run on existing <interactive> clusters for some reason, then the sync client needs to map to the corresponding cluster_id in the secondary workspace.

Access control lists (ACLs)

Can be templates in Git. Co-deploy to primary and secondary deployments for notebooks, folders, and clusters. However, hold the data for jobs until the DR event.

The Permissions API can set access controls for clusters, jobs, pools, notebooks, and folders. A sync client needs to map to corresponding object IDs for each object in the secondary workspace. Databricks recommends creating a map of object IDs from primary to secondary workspace while syncing those objects before replicating the access controls.

Libraries

Include in source code and cluster/job templates.

Sync custom libraries from centralized repositories, DBFS, or cloud storage (can be mounted).

Cluster init scripts

Include in the source code if you prefer.

For simpler synchronization, store init scripts in the primary workspace in a common folder or in a small set of folders if possible.

Mount points

Include in source code if created only through notebook-based jobs or Command API.

Use jobs. Note that the storage endpoints might change, given that workspaces would be in different regions. This depends a lot on your data DR strategy as well.

Table metadata

For Unity Catalog objects (catalogs, schemas, tables, volumes, and grants), co-deploy with the Databricks Terraform provider or Databricks Asset Bundles. For legacy Hive metastore tables, include create-table statements with source code if created through notebook-based jobs or the Command API.

For Unity Catalog objects, read source metadata from system tables or information_schema and replicate to the secondary workspace using the Databricks SDK. For legacy Hive metastore tables, compare metadata definitions between metastores using the Spark Catalog API or SHOW CREATE TABLE via a notebook or scripts. Underlying storage paths can be region-based and might differ between metastore instances.

Secrets

Include in source code if created only through Command API. Note that some secrets content might need to change between the primary and secondary.

Secrets are created in both workspaces via the API. Note that some secrets content might need to change between the primary and secondary.

Cluster configurations

Can be templates in Git. Co-deploy to primary and secondary deployments, although the ones in secondary deployment should be terminated until the DR event.

Clusters are created after they are synced to the secondary workspace using the API or CLI. Those can be explicitly terminated if you want, depending on auto-termination settings.

Notebook, job, and folder permissions

Can be templates in Git. Co-deploy to primary and secondary deployments.

Replicate using the Permissions API.

Choose regions and multiple secondary workspaces

Choose regions and multiple secondary workspaces

You control when DR triggers and which secondary region you fail over to. You're also responsible for stabilizing the DR environment before resuming normal operations. This typically means creating multiple Databricks workspaces for production and DR, then choosing a secondary failover region.

Before selecting your secondary region, confirm that all the resources and services you depend on (compute types, products, integrations) are available there. Some Databricks services are only available in specific regions.

Step 3: Prep workspaces and do a one-time copy

First, stand up a secondary Databricks workspace (or workspaces) and its supporting metastore in your chosen secondary region. The secondary workspace must mirror the primary's account, region, and identity configuration before you can replicate data or assets to it.

If you use managed disaster recovery, Databricks handles the initial bootstrap of in-scope catalogs and workspace assets when you create a failover group. You don't need to run a one-time copy for those resources. Continue with the rest of this section for any data sources or assets that managed DR doesn't replicate.

For a production workspace running outside managed DR's scope, run a one-time copy to sync the passive deployment with the active deployment. This copy handles:

  • Data replication: Use a cloud replication solution or Delta Deep Clone.
  • Token generation: Automate replication and future workloads with generated tokens.
  • Workspace replication: Replicate using the methods in Step 4: Prepare your data sources. For comprehensive guidance on exporting workspace configuration, data, and AI/ML assets, see Export workspace data.
  • Workspace validation: Test the workspace and process to confirm they execute successfully and produce the expected results.

Subsequent syncs run faster than the initial copy, and your tooling logs record what changed and when.

Step 4: Prepare your data sources

Databricks can process a large variety of data sources using batch processing or data streams.

Batch processing from data sources

Batch processing from data sources

Batch data usually resides in a source you can replicate or deliver to another region.

For example, data often uploads to cloud storage on a schedule. In DR mode, point those uploads at your secondary-region storage and update workloads to read from and write to that storage.

Data streams

Data streams

Processing a data stream is a bigger challenge. Streaming data can be ingested from various sources, processed, and sent to a streaming solution:

  • Message queue such as Kafka
  • Database change data capture stream
  • File-based continuous processing
  • File-based scheduled processing, also known as trigger once

In all of these cases, you must configure your data sources to handle DR mode and to use your secondary deployment in your secondary region.

A stream writer stores a checkpoint with information about the data that has been processed. This checkpoint can contain a data location (usually cloud storage) that has to be modified to a new location to ensure a successful restart of the stream. For example, the source subfolder under the checkpoint might store the file-based cloud folder.

This checkpoint must be replicated in a timely manner. Consider synchronization of the checkpoint interval with any new cloud replication solution.

The checkpoint update is a function of the writer and therefore applies to data stream ingestion or processing and storing on another streaming source.

For streaming workloads, ensure that checkpoints are configured in customer-managed storage so that they can be replicated to the secondary region for workload resumption from the point of last failure. You might also choose to run the secondary streaming process in parallel to the primary process.

Step 5: Implement and test your solution

If you use managed disaster recovery, you can trigger a planned failover from the account console to validate that your setup works end to end. The same procedure covers both DR tests and real outages. See Fail over and fail back.

Test your DR setup regularly. An untested DR plan often fails when you need it. Some teams switch active regions every few months on a schedule to validate assumptions, exercise processes, and keep the team familiar with the runbook.

important

Test your DR solution in real-world conditions on a regular schedule.

If a test reveals a missing object or template, update your plan: remove the dependency, replicate it to the secondary workspace, or make it available another way.

Test the organizational and configuration changes too. Your DR plan affects your deployment pipeline, so the team must know what to keep in sync. After you set up DR workspaces, confirm that your infrastructure, jobs, notebooks, libraries, and other workspace objects are available in the secondary region.

Expand your standard work processes and configuration pipelines to deploy changes to all workspaces. Manage user identities across workspaces, and configure job automation and monitoring for the new workspaces.

Plan and test changes to your configuration tooling.

Configuration changes to plan and test

For each of the following, prepare a plan for failover and test all assumptions:

  • Ingestion: Understand where your data sources are and where those sources get their data. Where possible, parameterize the source and use a separate configuration template for the secondary deployment and region.
  • Execution changes: If you have a scheduler to trigger jobs or other actions, you might need a separate scheduler that works with the secondary deployment or its data sources.
  • Interactive connectivity: Consider how configuration, authentication, and network connections might be affected by regional disruptions for any use of REST APIs, CLI tools, or other services such as JDBC/ODBC.
  • Automation changes: For all automation tools.
  • Outputs: For any tools that generate output data or logs.
  • Downstream changes: For BI tools, dashboards, schedulers, and third-party integrations that read from or write to Databricks, plan how to repoint them at the secondary workspace and notify their owners.

Test failover

Test failover

Many scenarios can trigger DR: an unexpected outage in the cloud network, cloud storage, or another core service where you can't shut down gracefully; a planned shutdown or outage; or even periodic switching between regions as part of your test cycle.

To test failover, connect to the system and run a shutdown. Confirm that all jobs complete and clusters terminate.

A sync client (or CI/CD tooling) replicates relevant Databricks objects and resources to the secondary workspace. To activate the secondary workspace, your process might include some or all of the following:

  1. Run tests to confirm that the platform is up to date.
  2. Disable pools and clusters on the primary region so that if the failed service returns online, the primary region does not start processing new data.
  3. Run the recovery process for your data sources (see below).
  4. Start relevant pools (or increase the min_idle_instances to a relevant number).
  5. Start relevant clusters (if not terminated).
  6. Change the concurrent run for jobs and run the relevant jobs. These could be one-time runs or periodic runs.
  7. For any outside tool that uses a URL or domain name for your Databricks workspace, update configurations to account for the new control plane. For example, update URLs for REST APIs and JDBC/ODBC connections. The Databricks web application's customer-facing URL changes when the control plane changes, so notify your organization's users of the new URL.

Recovery process details

  1. Check the date of the latest synced data. See Disaster recovery industry terminology. The details of this step vary depending on how you synchronize data and your unique business needs.
  2. Stabilize your data sources and ensure that they are all available. Include all external data sources, such as AWS RDS, and your Delta Lake, Parquet, or other files.
  3. Find your streaming recovery point. Set up the process to restart from there and have a process ready to identify and eliminate potential duplicates (Delta Lake makes this easier).
  4. Complete the data flow process and inform the users.

Test restore (failback)

Test restore (failback)

Failback is easier to control and can be done in a maintenance window. Plan for some or all of the following steps:

  1. Get confirmation that the primary region is restored.
  2. Disable pools and clusters on the secondary region so it doesn't start processing new data.
  3. Sync any new or modified assets in the secondary workspace back to the primary deployment. Depending on the design of your failover scripts, you might be able to run the same scripts to sync objects from the secondary (DR) region to the primary (production) region.
  4. Sync any new data updates back to the primary deployment. You can use the audit trails of logs and Delta tables to guarantee no loss of data.
  5. Shut down all workloads in the DR region.
  6. Change the jobs and users URL to the primary region, and repoint downstream connections (BI tools, schedulers, third-party integrations) back to it.
  7. Run tests to confirm that the platform is up to date.
  8. Start relevant pools (or increase the min_idle_instances to a relevant number).
  9. Start relevant clusters (if not terminated).
  10. Change the concurrent run for jobs, and run relevant jobs. These could be one-time runs or periodic runs.
  11. As needed, set up your secondary region again for future DR.

Automation scripts, samples, and prototypes

For AWS and Azure, managed disaster recovery handles workspace and managed-table replication without custom automation. The references below apply only if you're building a DIY solution outside managed DR's scope.

For DIY DR pipelines, use the Databricks Terraform provider to manage workspace assets as code and co-deploy to primary and secondary regions.

Additional resources