Best practices for reliability
This article covers best practices for reliability organized by architectural principles listed in the following sections.
1. Design for failure
Use a data format that supports ACID transactions
ACID transactions are a critical feature for maintaining data integrity and consistency. Choosing a data format that supports ACID transactions helps build pipelines that are simpler and much more reliable.
Delta Lake is an open source storage framework that provides ACID transactions as well as schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake is fully compatible with the Apache Spark APIs and is designed for tight integration with structured streaming, allowing you to easily use a single copy of data for both batch and streaming operations and provide incremental processing at scale.
Use a resilient distributed data engine for all workloads
Apache Spark, as the compute engine of the Databricks lakehouse, is based on resilient distributed data processing. If an internal Spark task does not return a result as expected, Apache Spark automatically reschedules the missing tasks and continues to execute the entire job. This is helpful for out-of-code failures, such as a brief network problem or a revoked Spot VM. Working with both the SQL API and the Spark DataFrame API, this resiliency is built into the engine.
In the Databricks lakehouse, Photon, a native vectorized engine entirely written in C++, is a high performance compute compatible with Apache Spark APIs.
Automatically rescue invalid or nonconforming data
Invalid or nonconforming data can cause workloads that rely on an established data format to crash. To increase the end-to-end resiliency of the entire process, it is best practice to filter out invalid and nonconforming data at ingest. Support for rescued data ensures that you never lose or miss data during ingest or ETL. The rescued data column contains any data that was not parsed, either because it was missing from the given schema, because there was a type mismatch, or because the column body in the record or file did not match that in the schema.
Databricks Auto Loader: Auto Loader is the ideal tool for streaming the file ingestion. It supports rescued data for JSON and CSV. For example, for JSON, the rescued data column contains any data that was not parsed, possibly because it was missing from the given schema, because there was a type mismatch, or because the casing of the column did not match. The rescued data column is part of the schema returned by Auto Loader as
_rescued_data
by default when the schema is being inferred.Delta Live Tables: Another option to build workflows for resilience is to use Delta Live Tables with quality constraints. See Manage data quality with pipeline expectations. Delta Live Tables supports three modes by default: retain, drop, and fail on invalid records. To quarantine identified invalid records, expectation rules can be defined in a specific way so that invalid records are stored (“quarantined”) in another table. See Quarantine invalid records.
Configure jobs for automatic retries and termination
Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
Databricks jobs support retry policies for tasks that determines when and how many times failed runs are retried. See Set a retry policy.
You can configure optional duration thresholds for a task, including an expected completion time for the task and a maximum completion time for the task.
Delta Live Tables also automates failure recovery by using escalating retries to balance speed and reliability. See Development and production modes.
On the other hand, a hanging task can prevent the entire job from completing, resulting in high costs. Databricks jobs support timeout configuration to kill jobs that take longer than expected.
Use scalable and production-grade model serving infrastructure
For batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. See Deploy models for batch inference and prediction.
Model serving provides a scalable and production-grade infrastructure for real-time model serving. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account.
Use managed services where possible
Leverage managed services (serverless compute) from the Databricks Data Intelligence Platform, such as:
These services are operated by Databricks in a reliable and scalable manner at no additional cost to the customer, making workloads more reliable.
2. Manage data quality
Use a layered storage architecture
Curate data by creating a layered architecture and ensuring that data quality increases as data moves through the layers. A common layering approach is:
Raw layer (bronze): Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the raw layer, it is possible to rebuild subsequent layers from this layer as needed.
Curated layer (silver): The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a solid, reliable foundation for analysis and reporting across all roles and functions.
Final layer (gold): The third layer is built around business or project needs. It provides a different view as data products to other business units or projects, preparing data around security needs (such as anonymized data) or optimizing for performance (such as with pre-aggregated views). The data products in this layer are considered as the truth for the business.
The final layer should contain only high quality data and be fully trusted from a business perspective.
Improve data integrity by reducing data redundancy
Copying or duplicating data creates data redundancy and leads to loss of integrity, loss of data lineage, and often different access permissions. This reduces the quality of the data in the lakehouse.
A temporary or disposable copy of data is not harmful in itself - it is sometimes necessary to increase agility, experimentation, and innovation. However, when these copies become operational and are regularly used to make business decisions, they become data silos. When these data silos become out of sync, it has a significant negative impact on data integrity and quality, raising questions such as “Which data set is the master?” or “Is the data set current?
Actively manage schemas
Uncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema:
Delta Lake supports schema validation and schema enforcement by automatically handling schema variations to prevent the insertion of bad records during ingestion. See Schema enforcement.
Auto Loader detects the addition of new columns as it processes your data. By default, the addition of a new column causes your streams to stop with an
UnknownFieldException
. Auto Loader supports several modes for schema evolution.
Use constraints and data expectations
Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table is automatically checked. When a constraint is violated, Delta Lake throws an InvariantViolationException
error to signal that the new data can’t be added. See Constraints on Databricks.
To further improve this handling, Delta Live Tables supports expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to be taken if a record violates the invariant. Expectations on queries use Python decorators or SQL constraint clauses. See Manage data quality with pipeline expectations.
Take a data-centric approach to machine learning
A guiding principle that remains at the core of the AI vision for the Databricks Data Intelligence Platform is a data-centric approach to machine learning. As generative AI becomes more prevalent, this perspective remains just as important.
The core components of any ML project can simply be thought of as data pipelines: Feature engineering, training, model deployment, inference, and monitoring pipelines are all data pipelines. As such, operationalizing an ML solution requires merging data from prediction, monitoring, and feature tables with other relevant data. Fundamentally, the easiest way to achieve this is to develop AI-powered solutions on the same platform used to manage production data. See Data-centric MLOps and LLMOps
3. Design for autoscaling
Enable autoscaling for ETL workloads
Autoscaling allows clusters to automatically resize based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use autoscaling and how to get the most benefit.
For streaming workloads, Databricks recommends using Delta Live Tables with autoscaling. Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.
Enable autoscaling for SQL warehouse
The scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is one one cluster without autoscaling.
To handle more concurrent users for a given warehouse, increase the number of clusters. To learn how Databricks adds clusters to and removes clusters from a warehouse, see SQL warehouse sizing, scaling, and queuing behavior.
4. Test recovery procedures
Recover from Structured Streaming query failures
Structured Streaming provides fault tolerance and data consistency for streaming queries. Using Databricks Jobs, you can easily configure your Structured Streaming queries to automatically restart on failure. By enabling checkpointing for a streaming query, you can restart the query after a failure. The restarted query will continue from where the failed query left off. See Structured Streaming checkpoints and Production considerations for Structured Streaming.
Recover ETL jobs using data time travel capabilities
Despite thorough testing, a job may fail in production or produce unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the problem and fixing the pipeline that caused the problem in the first place. However, often this is not straightforward and the job in question should be rolled back. Using Delta Time travel, users can easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline.
A convenient way to do so is the `RESTORE` command.
Leverage a job automation framework with built-in recovery
Databricks Jobs are designed for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), Databricks Jobs provide a matrix view of the runs that allows you to investigate the problem that caused the failure, see View runs for a single job. Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Databricks Jobs. It will run only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money, see Troubleshoot and repair job failures
Configure a disaster recovery pattern
For a cloud-native data analytics platform like Databricks, a clear disaster recovery pattern is critical. It’s critical that your data teams can use the Databricks platform even in the rare event of a regional, service-wide outage of a cloud service provider, whether caused by a regional disaster such as a hurricane, earthquake, or some other source.
Databricks is often a core part of an overall data ecosystem that includes many services, including upstream data ingestion services (batch/streaming), cloud-native storage such as Amazon S3, downstream tools and services such as business intelligence apps, and orchestration tooling. Some of your use cases may be particularly sensitive to a regional service-wide outage.
Disaster recovery involves a set of policies, tools, and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. A large cloud service such as AWS serves many customers and has built-in protections against a single failure. For example, a region is a group of buildings connected to different power sources to ensure that a single power outage will not bring down a region. However, cloud region failures can occur, and the severity of the failure and its impact on your business can vary.
5. Automate deployments and workloads
See Operational Excellence - Automate deployments and workloads.
6. Monitor systems and workloads
See Operational Excellence - Set up monitoring, alerting, and logging.