Best practices for reliability

This article covers best practices for reliability organized by architectural principles listed in the following sections.

1. Design for failure

Use Delta Lake

Delta Lake is an open source storage format that brings reliability to data lakes. Delta Lake provides ACID transactions, schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns. See What is Delta Lake?.

Use Apache Spark or Photon for distributed compute

Apache Spark, as the compute engine of the Databricks lakehouse, is based on resilient distributed data processing. In case of an internal Spark task not returning a result as expected, Apache Spark automatically reschedules the missing tasks and continues with the execution of the entire job. This is helpful for failures outside the code, like a short network issue or a revoked spot VM. Working with both the SQL API and the Spark DataFrame API comes with this resilience built into the engine.

In the Databricks lakehouse, Photon, a native vectorized engine entirely written in C++, is high performance compute compatible with Apache Spark APIs.

Automatically rescue invalid or nonconforming data

Invalid or nonconforming data can lead to crashes of workloads that rely on an established data format. To increase the end-to-end resilience of the whole process, it is best practice to filter out invalid and nonconforming data at ingestion. Supporting rescued data ensures you never lose or miss out on data during ingest or ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, because there was a type mismatch, or the column casing in the record or file didn’t match that in the schema.

  • Databricks Auto Loader: Auto Loader is the ideal tool for streaming the ingestion of files. It supports rescued data for JSON and CSV. For example, for JSON, the rescued data column contains any data that wasn’t parsed, possibly because it was missing from the given schema, because there was a type mismatch, or because the casing of the column didn’t match. The rescued data column is part of the schema returned by Auto Loader as _rescued_data by default when the schema is being inferred.

  • Delta Live Tables: Another option to build workflows for resilience is using Delta Live Tables with quality constraints. See Manage data quality with Delta Live Tables. Out of the box, Delta Live Tables supports three modes: Retain, drop, and fail on invalid records. To quarantine identified invalid records, expectation rules can be defined in a specific way so that invalid records are stored (“quarantined”) in another table. See Quarantine invalid data.

Configure jobs for automatic retries and termination

Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.

On the other hand, a task that hangs can prevent the whole job from finishing, thus incurring high costs. Databricks jobs support a timeout configuration to terminate jobs that take longer than expected.

Use a scalable and production-grade model serving infrastructure

For batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. See Use MLflow for model inference.

Model serving provides a scalable and production-grade model real-time serving infrastructure. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account.

Use managed services where possible

Leverage managed services of the Databricks Data Intelligence Platform like serverless compute, model serving, or Delta Live Tables where possible. These services are - without extra effort by the customer - operated by Databricks in a reliable and scalable way, making workloads more reliable.

2. Manage data quality

Use a layered storage architecture

Curate data by creating a layered architecture and ensuring data quality increases as data moves through the layers. A common layering approach is:

  • Raw layer (bronze): Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the raw layer, rebuilding the subsequent layers from this layer is possible if needed.

  • Curated layer (silver): The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a sound, reliable foundation for analyses and reports across all roles and functions.

  • Final layer (gold): The third layer is created around business or project needs. It provides a different view as data products to other business units or projects, preparing data around security needs (such as anonymized data) or optimizing for performance (such as with preaggregated views). The data products in this layer are seen as the truth for the business.

The final layer should only contain high-quality data and can be fully trusted from a business point of view.

Improve data integrity by reducing data redundancy

Copying or duplicating data creates data redundancy and will lead to lost integrity, lost data lineage, and often different access permissions. This will decrease the quality of the data in the lakehouse. A temporary or throwaway copy of data is not harmful on its own - it is sometimes necessary for boosting agility, experimentation and innovation. However, if these copies become operational and regularly used for business decisions, they become data silos. These data silos getting out of sync has a significant negative impact on data integrity and quality, raising questions such as “Which data set is the master?” or “Is the data set up to date?”.

Actively manage schemas

Uncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema:

  • Delta Lake supports schema validation and schema enforcement by automatically handling schema variations to prevent the insertion of bad records during ingestion.

  • Auto Loader detects the addition of new columns as it processes your data. By default, the addition of a new column causes your streams to stop with an UnknownFieldException. Auto Loader supports several modes for schema evolution.

Use constraints and data expectations

Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table are automatically verified. When a constraint is violated, Delta Lake throws an InvariantViolationException error to signal that the new data can’t be added. See Constraints on Databricks.

To further improve this handling, Delta Live Tables supports Expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to take when a record fails the invariant. Expectations to queries use Python decorators or SQL constraint clauses. See Manage data quality with Delta Live Tables.

Take a data-centric approach to machine learning

Feature engineering, training, inference, and monitoring pipelines are data pipelines. They must be as robust as other production data engineering processes. Data quality is crucial in any ML application, so ML data pipelines should employ systematic approaches to monitoring and mitigating data quality issues. Avoid tools that make it challenging to join data from ML predictions, model monitoring, and so on, with the rest of your data. The simplest way to achieve this is to develop ML applications on the same platform used to manage production data. For example, instead of downloading training data to a laptop, where it is hard to govern and reproduce results, secure the data in cloud storage and make that storage available to your training process.

3. Design for autoscaling

Enable autoscaling for batch workloads

Autoscaling allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use Autoscaling and how to get the most benefit.

For streaming workloads, Databricks recommends using Delta Live Tables with autoscaling. See Use autoscaling to increase efficiency and reduce resource usage.

Enable autoscaling for SQL warehouse

The scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is a minimum of one and a maximum of one cluster.

To handle more concurrent users for a given warehouse, increase the cluster count. To learn how Databricks adds clusters to and removes clusters from a warehouse, see SQL warehouse sizing, scaling, and queuing behavior.

Use Delta Live Tables enhanced autoscaling

Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.

4. Test recovery procedures

Create regular backups

To recover from a failure, regular backups must be available. The Databricks Labs project migrate allows workspace admins to create backups by exporting most of the assets of their workspaces (the tool uses the Databricks CLI/API in the background). See Databricks Migration Tool. Backups can be used either for restoring workspaces or for importing into a new workspace in case of a migration.

Recover from Structured Streaming query failures

Structured Streaming provides fault-tolerance and data consistency for streaming queries. Using Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically. The restarted query continues where the failed one left off. See Recover from Structured Streaming query failures with workflows.

Recover ETL jobs based on Delta time travel

Despite thorough testing, a job in production can fail or produce some unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the issue and fixing the pipeline that led to the issue in the first place. However, often this is not straightforward, and the respective job should be rolled back. Using Delta Time travel allows users to easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline. See What is Delta Lake time travel?.

A convenient way to do so is the RESTORE command.

Use Databricks Workflows and built-in recovery

Databricks Workflows are built for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), Databricks Workflows provide a matrix view of the runs, which lets you examine the issue that led to the failure. See View runs for a job. Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Databricks Workflows. It runs only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money.

Configure a disaster recovery pattern

A clear disaster recovery pattern is critical for a cloud-native data analytics platform like Databricks. For some companies, it’s critical that your data teams can use the Databricks platform even in the rare case of a regional service-wide cloud-service provider outage, whether caused by a regional disaster like a hurricane or earthquake or another source.

Databricks is often a core part of an overall data ecosystem that includes many services, including upstream data ingestion services (batch/streaming), cloud-native storage, downstream tools and services such as business intelligence apps, and orchestration tooling. Some of your use cases might be particularly sensitive to a regional service-wide outage.

Disaster recovery involves a set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. A large cloud service like Azure, AWS, or GCP serves many customers and has built-in guards against a single failure. For example, a region is a group of buildings connected to different power sources to guarantee that a single power loss will not shut down a region. However, cloud region failures can happen, and the degree of disruption and its impact on your organization can vary. See Disaster recovery.

Essential parts of a disaster recovery strategy are selecting a strategy (active/active or active/passive), selecting the right toolset, and testing both failover and restore.

5. Automate deployments and workloads

In the Operational excellence article, see Operational Excellence - Automate deployments and workloads.

6. Set up monitoring, alerting, and logging

In the Operational excellence best practices article, see Operational Excellence - Set up monitoring, alerting, and logging.