This article covers best practices of operational excellence, organized by architectural principles listed in the following sections.
It is a general best practice to have a platform operations team to enable data teams to work on one or more data platforms. This team is responsible for coming up with blueprints and best practices internally. They provide tooling - for example, for infrastructure automation and self-service access - and ensure that security and compliance needs are met. This way, the burden of securing platform data is on a central team, so that distributed teams can focus on working with data and producing new insights.
The Databricks Repos feature allows users to store notebooks or other files in a Git repository, providing features like cloning a repository, committing and pushing, pulling, branch management and viewing file diffs. Use Repos for better code visibility and tracking. See Git integration with Databricks Repos.
Continuous integration and continuous delivery (CI/CD) refer to developing and delivering software in short, frequent cycles using automation pipelines. While this is by no means a new process, having been ubiquitous in traditional software engineering for decades, it is becoming an increasingly necessary process for data engineering and data science teams. For data products to be valuable, they must be delivered in a timely manner. Additionally, consumers must have confidence in the validity of outcomes within these products. By automating the building, testing, and deployment of code, development teams can deliver releases more frequently and reliably than manual processes still prevalent across many data engineering and data science teams. See What is CI/CD on Databricks?.
For more information about best practices for code development using Databricks Repos, see CI/CD techniques with Git and Databricks Repos. This, together with the Databricks REST API, allows building automated deployment processes with GitHub Actions, Azure DevOps pipelines, or Jenkins jobs.
Building and deploying ML models is complex. There are many options available to achieve this, but little in the way of well-defined standards. As a result, over the past few years, we have seen the emergence of machine learning operations (MLOps). MLOps is a set of processes and automation for managing models, data, and code to improve performance stability and long-term efficiency in ML systems. It covers data preparation, exploratory data analysis (EDA), feature engineering, model training, model validation, deployment, and monitoring. See MLOps workflows on Databricks.
Always keep your business goals in mind: Just as the core purpose of ML in a business is to enable data-driven decisions and products, the core purpose of MLOps is to ensure that those data-driven applications remain stable, are kept up to date and continue to have positive impacts on the business. When prioritizing technical work on MLOps, consider the business impact: Does it enable new business use cases? Does it improve data teams’ productivity? Does it reduce operational costs or risks?
Manage ML models with a specialized but open tool: It is recommended to track and manage ML models with MLflow, which has been designed with the ML model lifecycle in mind. See MLflow guide.
Implement MLOps in a modular fashion: As with any software application, code quality is paramount for an ML application. Modularized code enables testing of individual components and mitigates difficulties with future code refactoring. Define clear steps (like training, evaluation, or deployment), super steps (like training-to-deployment pipeline), and responsibilities to clarify the modular structure of your ML application.
This is described in detail in the Databricks MLOps whitepaper.
HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. The Databricks Terraform provider manages Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. The Databricks Terraform provider is the recommended tool to deploy and manage clusters and jobs reliably, provision Databricks workspaces, and configure data access.
Databricks workspace admins can control many aspects of the clusters that are spun up, including available instance types, Databricks versions, and the size of instances by using cluster policies. Workspace admins can enforce some Spark configuration settings, and they can configure multiple cluster policies, allowing certain groups of users to create small clusters or single-user clusters, some groups of users to create large clusters and other groups only to use existing clusters. See Create and manage compute policies.
Workflows with jobs (internal orchestration):
We recommend using workflows with jobs to schedule data processing and data analysis tasks on Databricks clusters with scalable resources. Jobs can consist of a single task or a large, multitask workflow with complex dependencies. Databricks manages task orchestration, cluster management, monitoring, and error reporting for all your jobs. You can run your jobs immediately or periodically through an easy-to-use scheduling system. You can implement job tasks using notebooks, JARS, Delta Live Tables pipelines, or Python, Scala, Spark submit, and Java applications. See Introduction to Databricks Workflows.
The comprehensive Databricks REST API is used by external orchestrators to orchestrate Databricks assets, notebooks, and jobs. See Apache Airflow.
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It can ingest many file formats like JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE. With an input folder on the cloud storage, Auto Loader automatically processes new files as they arrive.
For one-off ingestions, consider using the command COPY INTO instead. See Get started using COPY INTO to load data.
Delta Live Tables is a declarative framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling.
With Delta Live Tables, easily define end-to-end data pipelines in SQL or Python: Specify the data source, the transformation logic, and the destination state of the data. Delta Live Tables maintains dependencies and automatically determines the infrastructure to run the job in.
To manage data quality, Delta Live Tables monitors data quality trends over time, preventing bad data from flowing into tables through validation and integrity checks with predefined error policies. See What is Delta Live Tables?.
The deploy-code approach follows these steps:
Training environment: Develop training code and ancillary code. Then promote the code to staging.
Staging environment: Train model on data subset and test ancillary code. Then promote the code to production.
Production environment: Train model on prod data and test model. Then deploy the model and ancillary pipelines.
The main advantages of this model are:
This fits traditional software engineering workflows, using familiar tools like Git and CI/CD systems.
Supports automated retraining in a locked-down environment.
Only the production environment needs read access to prod training data.
Full control over the training environment, which helps to simplify reproducibility.
It enables the data science team to use modular code and iterative testing, which helps with coordination and development in larger projects.
This is described in detail in the MLOps whitepaper.
Since model lifecycles do not correspond one-to-one with code lifecycles, it makes sense for model management to have its own service. MLflow and its Model Registry support managing model artifacts directly via UI and APIs. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Model artifacts are secured using MLflow access controls or cloud storage permissions. See Manage model lifecycle in Unity Catalog.
Databricks Autologging is a no-code solution that extends MLflow automatic logging to deliver automatic experiment tracking for machine learning training sessions on Databricks. Databricks Autologging automatically captures model parameters, metrics, files and lineage information when you train models with training runs recorded as MLflow tracking runs.
ML pipelines should be automated using many of the same techniques as other data pipelines. Use Databricks Terraform provider to automate deployment. ML requires deploying infrastructure such as inference jobs, serving endpoints, and featurization jobs. All ML pipelines can be automated as Workflows with Jobs, and many data-centric ML pipelines can use the more specialized Auto Loader to ingest images and other data and Delta Live Tables to compute features or to monitor metrics.
Integrating Databricks with CloudWatch enables metrics derived from logs and alerts. CloudWatch Application Insights can help you automatically discover the fields contained in the logs, and CloudWatch Logs Insights provides a purpose-built query language for faster debugging and analysis.
To help monitor clusters, Databricks provides access to Ganglia metrics from the cluster details page, and these include GPU metrics. See Ganglia metrics.
Monitoring the SQL warehouse is essential to understand the load profile over time and to manage the SQL warehouse efficiently. With SQL warehouse monitoring, you can view information, such as the number of queries handled by the warehouse or the number of clusters allocated to the warehouse.
Auto Loader provides a SQL API for inspecting the state of a stream. With SQL functions, you can find metadata about files that have been discovered by an Auto Loader stream. See Monitoring Auto Loader.
With Apache Spark Streaming Query Listener interface, Auto Loader streams can be further monitored.
An event log is created and maintained for every Delta Live Tables pipeline. The event log contains all information related to the pipeline, including audit logs, data quality checks, pipeline progress, and data lineage. You can use the event log to track, understand, and monitor the state of your data pipelines. See Monitor Delta Live Tables pipelines.
Streaming is one of the most important data processing techniques for ingestion and analysis. It provides users and developers with low latency and real-time data processing capabilities for analytics and triggering actions. The Databricks Data Intelligence Platform allows you to easily monitor Structured Streaming queries. See Monitoring Structured Streaming queries on Databricks.
Additional information can be found in the dedicated UI with real-time metrics and statistics. For more information, see A look at the new Structured Streaming UI in Apache Spark 3.0.
Every service launched on a cloud will have to take limits into account, such as access rate limits, number of instances, number of users, and memory requirements. For your cloud provider, check the cloud limits. Before designing a solution, these limits need to be understood.
Specifically, for the Databricks platform, there are different types of limits:
Databricks platform limits: These are specific limits for Databricks resources. The limits for the overall platform are documented in Limits.
Unity Catalog limits: Unity Catalog Resource Quotas
Subscription/account quotas: Databricks leverages cloud resources for its service. For example, workloads on Databricks run on clusters, for which the Databricks platform starts cloud provider’s virtual machines (VM). Cloud providers set default quotas on how many VMs can be started at the same time. Depending on the need, these quotas may need to be adjusted.
For further details, see Amazon EC2 service quotas.
In a similar way, storage, network, and other cloud services have limitations that need to be understood and factored in.
Plan for fluctuation in the expected load that can occur for several reasons like sudden business changes or even world events. Test variations of load, including unexpected ones, to ensure that your workloads can scale. Ensure all regions can adequately scale to support the total load if a region fails. To be taken into consideration:
Technology and service limits and limitations of the cloud. See Manage capacity and quota.
SLAs when determining the services to use in the design.
Cost analysis to determine how much improvement will be realized in the application if costs are increased. Evaluate if the price is worth the investment.