Best practices for operational excellence

This article covers best practices of operational excellence, organized by architectural principles listed in the following sections.

1. Optimize build and release processes

Create a dedicated Lakehouse operations team

A common best practice is to have a platform operations team to enable data teams to work on one or more data platforms. This team is responsible for creating blueprints and best practices internally. They provide tools - for example, for infrastructure automation and self-service access - and ensure that security and compliance requirements are met. This puts the burden of securing platform data on a central team, allowing distributed teams to focus on working with data and generating new insights.

Use Enterprise source code management (SCM)

Source code management (SCM) helps developers work more effectively, which can lead to faster release velocity and lower development costs. Having a tool that helps track changes, maintain code integrity, detect bugs, and roll back to previous versions is an important component of your overall solution architecture.

Databricks Git folders allow users to store notebooks or other files in a Git repository, providing features like cloning a repository, committing and pushing, pulling, branch management and viewing file diffs. Use Git folders for better code visibility and tracking.

Standardize DevOps processes (CI/CD)

Continuous integration and continuous delivery (CI/CD) refers to developing and deploying software in short, frequent cycles using automated pipelines. While this is not a new process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly necessary process for data engineering and data science teams. For data products to be valuable, they must be delivered in a timely manner. In addition, consumers must have confidence in the validity of the results within these products. By automating the process of building, testing, and deploying code, development teams can deliver releases more frequently and reliably than the manual processes that still dominate many data engineering and data science teams. See What is CI/CD on Databricks?.

For more information on best practices for code development using Databricks Git folders, see CI/CD techniques with Git and Databricks Git folders (Repos). Together with the Databricks REST API, you can create automated deployment processes using GitHub actions, Azure DevOps pipelines, or Jenkins jobs.

Standardize MLOps processes

MLOps processes provide reproducibility of ML pipelines, enabling more tightly-coupled collaboration across data teams, reducing conflict with devops and IT, and accelerating release velocity. As many models are used to drive key business decisions, standardizing MLops processes ensures that models are developed, tested, and deployed consistently and reliably.

Building and deploying ML models is complex. There are many options available to achieve this, but little in the way of well-defined standards. As a result, over the past few years, we have seen the emergence of machine learning operations (MLOps). MLOps is a set of processes and automation for managing models, data, and code to improve performance stability and long-term efficiency in ML systems. It covers data preparation, exploratory data analysis (EDA), feature engineering, model training, model validation, deployment, and monitoring.

MLOps on the Databricks platform can help you optimize the performance and long-term efficiency of your machine learning (ML) system:

  • Always keep your business goals in mind: Just as the core purpose of ML in a business is to enable data-driven decisions and products, the core purpose of MLOps is to ensure that those data-driven applications remain stable, are kept up to date and continue to have positive impacts on the business. When prioritizing technical work on MLOps, consider the business impact: Does it enable new business use cases? Does it improve data teams’ productivity? Does it reduce operational costs or risks?

  • Manage ML models with a specialized but open tool: You can use MLflow - designed for the ML model lifecycle - to track and manage ML models. See ML lifecycle management using MLflow.

  • Implement MLOps in a modular fashion: As with any software application, code quality is paramount for an ML application. Modularized code enables testing of individual components and mitigates difficulties with future code refactoring. Define clear steps (like training, evaluation, or deployment), super steps (like training-to-deployment pipeline), and responsibilities to clarify the modular structure of your ML application.

This is described in detail in the Databricks ebook The Big Book of MLOps.

Define environment isolation strategy

When an organization uses a data platform like Databricks, there is often a need to have data isolation boundaries between environments (such as development and production) or between organizational operating units.

Isolation standards may vary for your organization, but typically they include the following expectations:

  • Users can only gain access to data based on specified access rules.

  • Data can be managed only by designated people or teams.

  • Data is physically separated in storage.

  • Data can be accessed only in designated environments.

In Databricks, the workspace is the primary data processing environment and there are several scenarios where separate workspaces improve the overall setup, for example:

  • Isolate different business units with their own workspaces to avoid sharing the workspace administrator and to ensure that no assets in Databricks are shared unintentionally between business units.

  • Isolate software development lifecycle environments (such as development, staging, and production). For example, a separate production workspace allows you to test new workspace settings before applying them to production. Or the production environment might require more stringent workspace settings than the development environment. If you must deploy development, staging, and production environments on different virtual networks, you also need different workspaces for the three environments.

  • Split workspaces to overcome resource limitations: Cloud accounts/subscriptions have resource limitations. Splitting workspaces into different subscriptions/accounts is one way to ensure that enough resources are available for each workspace. In addition, Databricks workspaces also have resource limitations. Splitting workspaces ensures that workloads in each workspace always have access to the full set of resources.

However, there are some drawbacks to shared workspaces that should be considered as well:

  • Notebook collaboration does not work across workspaces.

  • If you want to separate workspaces for development, staging, and production and separate business units by workspaces, consider the limit to the number of workspaces.

  • For multiple workspaces, both setup and maintenance need to be fully automated (by Terraform, ARM, REST API, or other means). This is especially important for migration purposes.

  • If each workspace needs to be secured at the network layer (for example, to protect against data exfiltration), the required network infrastructure can be very expensive, especially for large numbers of workspaces.

It is important to find a balance between the need for isolation and the need for collaboration and the effort required to maintain it.

Define catalog strategy for your enterprise

Along with an environmental isolation strategy, organizations need a strategy for structuring and separating metadata and data. Data, including personally identifiable information, payment, or health information, carries a high potential risk, and with the ever-increasing threat of data breaches, it is important to separate and protect sensitive data no matter what organizational strategy you choose. Separate your sensitive data from non-sensitive data, both logically and physically.

An organization can require that certain types of data be stored in specific accounts or buckets in its cloud tenant. The Unity Catalog metastore allows metadata to be structured by its three-level catalog > schema > tables/views/volumes namespace, with storage locations configured at the metastore, catalog, or schema level to meet such requirements.

Organizational and compliance requirements often dictate that you keep certain data only in certain environments. You may also want to keep production data isolated from development environments, or ensure that certain data sets and domains are never merged. In Databricks, the workspace is the primary computing environment and catalogs are the primary data domain. Using the Unity Catalog metastore, administrators and catalog owners can bind catalogs to specific workspaces. These environment-aware bindings help you to ensure that only certain catalogs are available within a workspace, regardless of the specific data object permissions granted to a user.

For a full discussion of these topics, see Unity Catalog best practices

2. Automate deployments and workloads

Use infrastructure as code (IaC) for deployments and maintenance

Infrastructure as code (IaC) enables developers and operations teams to automatically manage, monitor, and provision resources, instead of manually configuring hardware devices, operating systems, applications, and services.

HashiCorp Terraform is a popular open source tool for creating a secure and predictable cloud infrastructure across several cloud providers. The Databricks Terraform provider manages Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. The Databricks Terraform Provider is the recommended tool for reliably deploying and managing clusters and jobs, provisioning Databricks workspaces, and configuring data access.

Standardize compute configurations

Standardizing computing environments ensures that the same software, libraries, and configurations are used across all environments. This consistency makes it easier to reproduce results, debug problems, and maintain systems across environments. With standardized environments, teams can save time and resources by eliminating the need to configure and set up environments from scratch. This also reduces the risk of errors and inconsistencies that can occur during manual setup. Standardization also enables the implementation of consistent security policies and practices across all environments. This can help organizations better manage risk and comply with regulatory requirements. Finally, standardization can help organizations better manage costs by reducing waste and optimizing resource utilization.

Standardization covers both environment setup and ongoing resource management. For consistent setup, Databricks recommends using infrastructure as code. To ensure that compute resources launched over time are configured consistently, use compute policies. Databricks workspace administrators can limit a user’s or group’s compute creation privileges based on a set of policy rules. They can enforce Spark configuration settings and enforce cluster-scoped library installations. You can also use compute policies to define T-shirt size clusters (S, M, L) for projects as a standard work environment.

Use automated workflows for jobs

Setting up automated workflows for jobs can help reduce unnecessary manual tasks and improve productivity through the DevOps process of creating and deploying jobs. The Data Intelligence Platform provides two ways to do this:

  • Databricks Workflows:

    Databricks Workflows orchestrates data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform. It is a fully managed orchestration service integrated with the Databricks platform:

    • Databricks Jobs are a way to run your data processing and analytics applications in a Databricks workspace. Your job can be a single task or a large, multi-task workflow with complex dependencies. Databricks manages the task orchestration, cluster management, monitoring, and error reporting for all of your jobs.

    • Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations you want to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.

  • External orchestrators:

    The comprehensive Databricks REST API is used by external workflow engines to orchestrate Databricks assets, notebooks, and jobs. See for example Apache Airflow.

We recommend using Databricks Workflows for all task dependencies in Databricks and - if needed - integrating these encapsulated workflows into the external orchestrator

Use automated and event-driven file ingestion

Event-driven (vs. schedule-driven) file ingestion has several benefits, including efficiency, increased data freshness, and real-time data ingestion. Running a job only when an event occurs ensures that you don’t waste resources, which saves money.

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It can ingest many file formats like JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE. With an input folder on the cloud storage, Auto Loader automatically processes new files as they arrive.

For one-off ingestions, consider using the command `COPY INTO` instead.

Use ETL frameworks for data pipelines

While it is possible to perform ETL tasks manually, there are many benefits to using a framework. A framework brings consistency and repeatability to the ETL process. By providing pre-built functions and tools, a framework can automate common tasks, saving time and resources. ETL frameworks can handle large volumes of data and can be easily scaled up or down as needed. This makes it easier to manage resources and respond to changing business needs. Many frameworks include built-in error handling and logging capabilities, making it easier to identify and resolve problems. And they often include data quality checks and validations to ensure that data meets certain standards before it is loaded into the data warehouse or data lake.

Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations you want to perform on your data, and Delta Live Tables handles task orchestration, cluster management, monitoring, data quality, and error handling.

With Delta Live Tables, you can define end-to-end data pipelines in SQL or Python: Specify the data source, transformation logic, and target state of the data. Delta Live Tables maintains dependencies and automatically determines the infrastructure on which to run the job.

To manage data quality, Delta Live Tables monitors data quality trends over time and prevents bad data from entering tables through validation and integrity checks with predefined error policies. See What is Delta Live Tables?.

Follow the deploy-code approach for ML workloads

Code and models often progress asynchronously through the software development stages. There are two ways to achieve this:

  • deploy code: An ML project is coded in the development environment, and this code is then moved to the staging environment, where it is tested. Following successful testing, the project code is deployed to the production environment, where it is executed.

  • deploy model: Model training is executed in the development environment. The produced model artifact is then moved to the staging environment for model validation checks, before deployment of the model to the production environment.

See Model deployment patterns.

Databricks recommends a deploy-code approach for the majority of use cases. The main advantages of this model are:

  • This fits traditional software engineering workflows, using familiar tools like Git and CI/CD systems.

  • It supports automated retraining in a locked-down environment.

  • It requires only the production environment to have read access to prod training data.

  • It provides full control over the training environment, helping to simplify reproducibility.

  • It enables the data science team to use modular code and iterative testing, helping with coordination and development in larger projects.

This is described in detail in the Databricks ebook The Big Book of MLOps.

Use a model registry to decouple code and model lifecycle

Since model lifecycles do not correspond one-to-one to code lifecycles, Unity Catalog allows the full lifecycle of ML models to be managed in its hosted version of the MLflow Model Registry. Models in Unity Catalog extends the benefits of Unity Catalog to ML models, including centralized access control, auditing, lineage, and model discovery across workspaces. Models in Unity Catalog are compatible with the open source MLflow Python client.

Automate ML experiment tracking

Tracking ML experiments is the process of saving relevant metadata for each experiment and organizing the experiments. This metadata includes experiment inputs/outputs, parameters, models, and other artifacts. The goal of experiment tracking is to create reproducible results across every stage of the ML model development process. Automating this process makes scaling the number of experiments easier, and ensures consistency in the metadata captured across all experiments.

Databricks Autologging is a no-code solution that extends MLflow automatic logging to deliver automatic experiment tracking for machine learning training sessions on Databricks. Databricks Autologging automatically captures model parameters, metrics, files, and lineage information when you train models with training runs recorded as MLflow tracking runs.

Reuse the same infrastructure to manage ML pipelines

The data used for ML pipelines typically comes from the same sources as the data used for other data pipelines. ML and data pipelines are similar in that they both prepare data for business user analysis or model training. Both also need to be scalable, secure, and properly monitored. In both cases, the infrastructure used should support these activities.

Use Databricks Terraform provider to automate deployments of ML environments. ML requires deploying infrastructure such as inference jobs, serving endpoints, and featurization jobs. All ML pipelines can be automated as Workflows with Jobs, and many data-centric ML pipelines can use the more specialized Auto Loader to ingest images and other data and Delta Live Tables to compute features or to monitor metrics.

Ensure to use Model Serving for enterprise-grade deployment of ML models.

Utilize declarative management for complex data and ML projects

Declarative frameworks within MLOps allow teams to define desired outcomes in high-level terms and let the system handle the details of execution, simplifying the deployment and scaling of ML models. These frameworks support continuous integration and deployment, automate testing and infrastructure management, and ensure model governance and compliance, ultimately accelerating time to market and increasing productivity across the ML lifecycle.

Databricks Asset Bundles (DABs) are a tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Bundles make it easy to manage complex projects during active development by providing CI/CD capabilities in your software development workflow using a single, concise, and declarative YAML syntax. By using bundles to automate your project’s testing, deployment, and configuration management, you can reduce errors while promoting software best practices across your organization as templated projects.

3. Manage capacity and quotas

Manage service limits and quotas

Managing service limits and quotas is important for maintaining a well-functioning infrastructure and preventing unexpected costs. Every service launched on a cloud must take limits into account, such as access rate limits, number of instances, number of users, and memory requirements. For your cloud provider, check the cloud limits. Before designing a solution, these limits must be understood.

Specifically, for the Databricks platform, there are different types of limits:

Databricks platform limits: These are specific limits for Databricks resources. The limits for the overall platform are documented in Resource limits.

Unity Catalog limits: Unity Catalog resource quotas

Subscription/account quotas: Databricks leverages cloud resources for its service. For example, workloads on Databricks run on clusters, for which the Databricks platform starts cloud provider’s virtual machines (VM). Cloud providers set default quotas on how many VMs can be started at the same time. Depending on the need, these quotas might need to be adjusted.

For further details, see Amazon EC2 service quotas.

In a similar way, storage, network, and other cloud services have limitations that must be understood and factored in.

Invest in capacity planning

Capacity planning involves managing cloud resources such as storage, compute, and networking to maintain performance while optimizing costs. Plan for variations in expected load, which can occur for a variety of reasons, including sudden business changes or even world events. Test load variations, including unexpected ones, to ensure your workloads can scale. Ensure that all regions can scale sufficiently to support the total load if one region fails. Consider:

  • Technology and service limitations and cloud constraints. See Manage capacity and quotas.

  • SLAs to determine the services to be used in the design.

  • Cost analysis to determine how much improvement in the application is realized if the cost is increased. Evaluate whether the price is worth the investment.

Understanding and planning for high priority (volume) events is important. If the provisioned cloud resources are not sufficient and workloads can’t scale, such increases in volume can cause an outage.

4. Set up monitoring, alerting, and logging

Establish monitoring processes

Establishing monitoring processes for a data platform is critical for several reasons. Monitoring processes enable early detection of issues such as data quality problems, performance bottlenecks, and system failures, which can help prevent downtime and data loss. They can help identify inefficiencies in the data platform and optimize costs by reducing waste and improving resource utilization. In addition, monitoring processes can help ensure compliance with regulatory requirements and provide audit trails of data access and usage.

Use native and external tools for platform monitoring

The Databricks Data Intelligence Platform has built-in monitoring solutions and integrates external monitoring systems:

  • Databricks Lakehouse Monitoring

    Databricks Lakehouse Monitoring allows you to monitor the statistical properties and quality of the data in all tables in your account. Data quality monitoring provides quantitative measures to track and confirm data consistency over time, and helps identify and alert users to changes in data distribution and model performance. You can also track the performance of machine learning models by monitoring inference tables that contain model inputs and predictions.

    See View Lakehouse Monitoring expenses to understand the cost of Lakehouse Monitoring.

  • SQL warehouse monitoring

    Monitoring the SQL warehouse is essential to understand the load profile over time and to manage the SQL warehouse efficiently. With SQL warehouse monitoring, you can view information, such as the number of queries handled by the warehouse or the number of clusters allocated to the warehouse.

  • Databricks SQL alerts

    Databricks SQL alerts periodically run queries, evaluate defined conditions, and send notifications if a condition is met. You can set up alerts to monitor your business and send notifications when reported data falls outside of expected limits.

In addition, you can create a Databricks SQL alert based on a metric from a monitor metrics table, for example, to get notified when a statistic moves out of a certain range or if data has drifted in comparison to the baseline table.

  • Auto Loader monitoring

    Auto Loader provides a SQL API for inspecting the state of a stream. With SQL functions, you can find metadata about files that have been discovered by an Auto Loader stream. See Monitoring Auto Loader.

    With Apache Spark Streaming Query Listener interface, Auto Loader streams can be further monitored.

  • Job monitoring

    Job monitoring helps you identify and address issues in your Databricks jobs, such as failures, delays, or performance bottlenecks. Job monitoring provides insights into job performance, enabling you to optimize resource utilization, reduce wastage, and improve overall efficiency.

  • Delta Live Tables monitoring

    An event log is created and maintained for every Delta Live Tables pipeline. The event log contains all information related to the pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use the event log to track, understand, and monitor the state of your data pipelines.

  • Streaming monitoring

    Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. The Databricks Data Intelligence Platform allows you to monitor Structured Streaming queries.

  • ML and AI monitoring

    Monitoring the performance of models in production workflows is an important aspect of the AI and ML model lifecycle. Inference tables simplify monitoring and diagnostics for models by continuously logging serving request inputs and responses (predictions) from Mosaic AI Model Serving endpoints and saving them into a Delta table in Unity Catalog. You can then use all of the capabilities of the Databricks platform, such as DBSQL queries, notebooks, and Lakehouse Monitoring to monitor, debug, and optimize your models.

    For more details on monitoring model serving, see Monitor model quality and endpoint health.