Phase 10: Design high availability and disaster recovery

In this phase, you design high availability (HA) and disaster recovery (DR) strategies to ensure business continuity and resilience.

High availability (HA) and disaster recovery (DR) are important facets of Databricks, and many organizations implement both to productionize critical use cases on Databricks.

Design high availability strategy

High availability is the ability to recover from an outage affecting a single cloud zone with minimal impact and transparently to users. It is measured as consistent uptime or percentage of uptime (SLA), with recovery typically in seconds or minutes.

Control plane high availability

The Databricks control plane consists of dozens of services, from front-end hosting and authentication to cluster management and endpoint deployment. As of early 2025, the core services of the control plane are multi-zonal, meaning that if a single zone within a cloud region fails, the control plane can recover these services into another zone and continue operating.

Control plane HA characteristics

Automatic failover: Services automatically fail over to healthy zones.
15-minute RTO: Failover typically completes within 15 minutes.
0 RPO: No data loss during zonal failures.
Multi-zone coverage: Works for cloud regions with multiple availability zones.

note

This does not apply to cloud regions with a single zone. In single-zone regions, a zonal outage is equivalent to a regional outage.

Compute plane high availability

Classic compute high availability

Deploy workspace subnets across at least 2 availability zones (recommended: 3).

Databricks automatically distributes cluster nodes across availability zones.
Configure job retries with exponential backoff for transient failures.
Monitor availability zone health and redistribute workloads during zone degradation.

Serverless compute high availability

Serverless compute automatically provides multi-zone failover.
No additional configuration required.
Automatically redistributes workloads during zone failures.

Best practices for compute HA

Use serverless compute for automatic multi-zone failover.
For classic compute, ensure subnets span multiple availability zones.
Configure automatic job retries for transient failures.
Test failover procedures regularly to validate HA configurations.
Monitor zone health metrics and adjust capacity planning accordingly.

Design disaster recovery strategy

Disaster recovery is the ability to recover from an outage affecting an entire cloud region and continue business operations in a secondary region. It is measured by acceptable downtime (RTO) and data loss (RPO), typically in hours.

DR considerations by layer

DR considerations need to be made on all layers of a data platform:

Clients

Clients should be able to change to the failover workspace URLs.
Update DNS or load balancer configurations for automatic failover.

Code and workspace objects

Use IaC/Terraform to deploy the platform infrastructure.
Use version control for all code such as notebooks.
Deploy on both sites using CI/CD.
Automate deployment to secondary region.

Identities

Users, service principals, groups.
Use SCIM synchronization on account level.
Identity federation ensures consistent access across regions.

Unity Catalog

Terraform or scripts to deploy or copy Unity Catalog infrastructure.
Terraform or scripts to copy metadata (table definitions, ACLs, and so on).
Terraform or scripts to copy OpenSharing configuration.
Deploy metastores in both primary and secondary regions.

Data

External and managed data: Geo-redundant replication (for example, raw data, data sources).
Delta tables: Delta Deep Clone for replicating tables across regions.
Landing zones: Replicate landing zone data to secondary region.

Streaming endpoints

Checkpoint information needs to be synced across regions.
Configure dual writes or checkpoint replication.

DR design patterns

Active-passive DR

Primary region serves all production traffic.
Secondary region on standby for failover.
Data replicated periodically (for example, hourly, daily).
Manual or automated failover on primary region failure.

Active-active DR

Both regions serve production traffic.
Data replicated continuously or near-real-time.
Load balanced across regions.
Higher cost but better performance and RTO.

Backup and restore

Regular backups to secondary region.
Manual restore process on primary region failure.
Lowest cost but highest RTO and RPO.
Suitable for non-critical workloads.

Define RTO and RPO requirements

Recovery Time Objective (RTO)

How long can the business tolerate downtime?
Determines failover automation requirements.
Influences infrastructure investment.

Recovery Point Objective (RPO)

How much data loss is acceptable?
Determines replication frequency.
Influences backup and replication strategy.

Example RTO/RPO requirements

Critical workloads: RTO < 1 hour, RPO < 15 minutes (active-active or active-passive with continuous replication).
Important workloads: RTO < 4 hours, RPO < 1 hour (active-passive with hourly replication).
Standard workloads: RTO < 24 hours, RPO < 24 hours (backup and restore).

Design DR testing strategy

DR procedures must be tested regularly to ensure they work when needed. Design a DR testing strategy appropriate for your RTO/RPO requirements.

DR testing patterns

Quarterly DR tests: Full failover to secondary region, validate all systems.
Monthly runbook validation: Review and update DR procedures.
Continuous automated tests: Test individual components regularly.
Tabletop exercises: Simulate DR scenarios with stakeholders.

DR testing best practices

Document detailed DR runbooks with step-by-step procedures.
Test DR procedures during non-peak hours.
Validate data integrity after failover.
Measure actual RTO and RPO during tests.
Update procedures based on test results.
Train operations teams on DR procedures.

HA and DR recommendations

Recommended

Use IaC (Terraform) to deploy infrastructure in both primary and secondary regions.
Deploy classic compute across multiple availability zones for HA.
Use serverless compute for automatic multi-zone failover.
Implement geo-redundant replication for raw data and data sources.
Use Delta Deep Clone to replicate critical tables across regions.
Use SCIM synchronization for consistent identity management across regions.
Document RTO and RPO requirements for all critical workloads.
Automate DR failover procedures where possible.
Test DR procedures regularly (at least quarterly).
Document detailed DR runbooks.

Evaluate based on requirements

Evaluate active-active DR for mission-critical workloads with stringent RTO requirements.
Balance DR investment with business criticality and RTO/RPO requirements.
Consider multi-cloud DR for highest resilience (but highest complexity and cost).

Phase 10 outcomes

After completing Phase 10, you should have:

High availability strategy designed (multi-zone deployment for classic compute).
Control plane HA characteristics understood (15-minute RTO, 0 RPO).
Disaster recovery strategy designed (active-passive, active-active, or backup/restore).
RTO and RPO requirements defined for critical workloads.
DR design patterns selected based on requirements.
DR testing strategy defined with regular testing schedule.
IaC approach defined for deploying to primary and secondary regions.
Data replication strategy designed (for example, geo-redundant storage, Delta Deep Clone).
Identity management strategy designed (SCIM synchronization).
DR runbooks documented with detailed procedures.

You have completed all phases of the Databricks deployment guide.

Design high availability strategy​

Control plane high availability​

Compute plane high availability​

Design disaster recovery strategy​

DR considerations by layer​

DR design patterns​

Define RTO and RPO requirements​

Design DR testing strategy​

HA and DR recommendations​

Phase 10 outcomes​