Best practices for interoperability and usability
This article covers best practices for interoperability and usability, organized by architectural principles listed in the following sections.
1. Define standards for integration
Use the Databricks REST API for external integration
The Databricks Lakehouse comes with a comprehensive REST API that allows you to manage nearly all aspects of the platform programmatically. The REST API server runs in the control plane and provides a unified endpoint to manage the Databricks platform. This is the preferred way to integrate Databricks, for example, into existing tools for CI/CD or MLOps. For integration in shell-based devices, Databricks CLI encapsulates many of the REST APIs in a command line interface.
Use optimized connectors to access data sources from the lakehouse
Databricks offers a variety of ways to help you ingest data into Delta Lake. Therefore, the lakehouse provides optimized connectors for many data formats and cloud services. See What data sources connect to Databricks with JDBC?. Many of them are already included in the Databricks Runtime. These connectors are often built and optimized for specific data sources.
Use partners available in Partner Connect
Businesses have different needs, and no single tool can meet all of them. Partner Connect allows you to explore and easily integrate with our partners, which cover all aspects of the lakehouse: data ingestion, preparation and transformation, BI and visualization, machine learning, data quality, and so on. Partner Connect lets you create trial accounts with selected Databricks technology partners and connect your Databricks workspace to partner solutions from the Databricks UI. Try partner solutions using your data in the Databricks Lakehouse, and then adopt the solutions that best meet your business needs.
Use Delta Live Tables and Auto Loader
Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. You define the transformations to perform on your data, and Delta Live Tables manages task orchestration, cluster management, monitoring, data quality, and error handling. See What is Delta Live Tables?.
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. It can reliably read data files from cloud storage. An essential aspect of both Delta Live Tables and Auto Loader is their declarative nature: Without them, one has to build complex pipelines that integrate different cloud services - such as a notification service and a queuing service - to reliably read cloud files based on events and allow the combining of batch and streaming sources reliably.
Auto Loader and Delta Live Tables reduce system dependencies and complexity and significantly improve the interoperability with the cloud storage and between different paradigms like batch and streaming. As a side effect, the simplicity of pipelines increases platform usability.
Use Infrastructure as Code for deployments and maintenance
HashiCorp Terraform is a popular open source tool for creating safe and predictable cloud infrastructure across several cloud providers. See Operational-Excellence > Use Infrastructure as Code for deployments and maintenance
2. Prefer open interfaces and open data formats
Use the Delta data format
The Delta Lake framework has many advantages, from reliability features to high-performance enhancements, and it is also a fully open data format. See:
Additionally, Delta Lake comes with a Delta Standalone library, which opens the Delta format for development projects. It is a single-node Java library that can read from and write to Delta tables. Dozens of third-party tools and applications support Delta Lake. Specifically, this library provides APIs to interact with table metadata in the transaction log, implementing the Delta Transaction Log Protocol to achieve the transactional guarantees of the Delta format. See What is Delta Lake?.
Use Delta Sharing to exchange data with partners
Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations regardless of which computing platforms they use. A Databricks user, called a “data provider”, can use Delta Sharing to share data with a person or group outside their organization, named a “data recipient”. Data recipients can immediately begin working with the latest version of the shared data. Delta Sharing is available for data in the Unity Catalog metastore.
Use MLflow to manage machine learning workflows
MLflow is an open source platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. Using MLflow on Databricks provides both advantages: You can write your ML workflow using an open and portable tool and use reliable services operated by Databricks (Tracking Server, Model Registry). See MLflow guide. It also adds enterprise-grade, scalable model serving, allowing you to host MLflow models as REST endpoints.
3. Lower the barriers for implementing use cases
Provide a self-service experience across the platform
The Databricks Lakehouse platform has all the capabilities required to provide a self-service experience. While there might be a mandatory approval step, the best practice is to fully automate the setup when a business unit requests access to the lakehouse. Automatically provision their new environment, sync users and use SSO for authentication, provide access control to common data and separate object storages for their own data, and so on. Together with a central data catalog containing semantically consistent and business-ready data sets, this quickly and securely provides access for new business units to the lakehouse capabilities and the data they need.
Use the serverless services of the platform
For serverless compute on the Databricks platform, the compute layer runs in the customer’s Databricks account. Cloud administrators no longer have to manage complex cloud environments that involve adjusting quotas, creating and maintaining networking assets, and joining billing sources. Users benefit from near-zero waiting times for cluster start and improved concurrency on their queries.
Offer predefined clusters and SQL warehouses for each use case
If using serverless services is not possible, remove the burden of defining a cluster (VM type, node size, and cluster size) from end users. This can be achieved in the following ways:
Provide shared clusters as immediate environments for users. On these clusters, use autoscaling down to a very minimum of nodes to avoid high idle costs.
Use cluster policies to define t-shirt-sized clusters (S, M, L) for projects as a standardized work environment.
4. Ensure data consistency and usability
Offer reusable data-as-products that the business can trust
Producing high-quality data-as-product is a primary goal of any data platform. The idea is that data engineering teams apply product thinking to the curated data: The data assets are their products, and the data scientists, ML and BI engineers, or any other business teams that consume data are their customers. These customers should be able to discover, address, and create value from these data-as-products through a self-service experience without the intervention of the specialized data teams.
Publish data products semantically consistent across the enterprise
A data lake usually contains data from different source systems. These systems sometimes name the same concept differently (such as customer vs. account) or mean different concepts by the same identifier. For business users to easily combine these data sets in a meaningful way, the data must be made homogeneous across all sources to be semantically consistent. In addition, for some data to be valuable for analysis, internal business rules must be applied correctly, such as revenue recognition. To ensure that all users are using the correctly interpreted data, data sets with these rules must be made available and published to Unity Catalog. Access to source data must be limited to teams that understand the correct usage.
Use Unity Catalog for data discovery and lineage exploration
In Unity Catalog, administrators and data stewards manage users and their access to data centrally across all workspaces in a Databricks account. Users in different workspaces can share the same data and, depending on user privileges granted centrally in Unity Catalog, joint data access is possible. See Discover and manage data using Data Explorer.
From a usability perspective, Unity Catalog provides the following two capabilities:
Data Explorer is the main UI for many Unity Catalog features. You can use Data Explorer to view schema details, preview sample data, and see table details and properties. Administrators can view and change owners, and admins and data object owners can grant and revoke permissions. You can also use Databricks Search, which enables users to find data assets (such as tables, columns, views, dashboards, models, and so on) easily and seamlessly. Users will be shown results that are relevant to their search requests and that they have access to. See Capture and view data lineage with Unity Catalog.
Data lineage across all queries run on a Databricks cluster or SQL warehouse. Lineage is supported for all languages and is captured down to the column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage can be visualized in Data Explorer in near real-time and retrieved with the Databricks REST API.
To allow enterprises to provide their users a holistic view of all data across all data platforms, Unity Catalog provides integration with enterprise data catalogs (sometimes referred to as the “catalog of catalogs”).