Databricks architecture overview

The Databricks Unified Data Analytics Platform, from the original creators of Apache Spark, enables data teams to collaborate in order to solve some of the world’s toughest problems.

High-level architecture

Databricks is structured to enable secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks so you can stay focused on your data science, data analytics, and data engineering tasks.

Databricks operates out of a control plane and a data plane.

  • The control plane includes the backend services that Databricks manages in its own AWS account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.
  • The data plane is where your data is processed.
    • For most Databricks computation, the compute resources are in your AWS account in what is called the Classic data plane. This is the type of data plane Databricks uses for notebooks, jobs, and for Classic Databricks SQL endpoints.
    • If you enable Serverless compute for Databricks SQL, the compute resources for Databricks SQL are in a shared Serverless data plane. The compute resources for notebooks, jobs and Classic Databricks SQL endpoints still live in the Classic data plane in the customer account. See Serverless compute.

You can use Databricks connectors so that your clusters can connect to external data sources outside of your AWS account to ingest data or for storage. You can also ingest data from external streaming data sources, such as events data, streaming data, IoT data, and more.

Although architectures can vary depending on custom configurations, the following diagram represents the most common structure and flow of data for Databricks on AWS environments.

The following diagram describes the overall architecture of the Classic data plane. For architectural details about the Serverless data plane that is used for Serverless SQL endpoints, see Serverless compute.

Databricks architecture

Your data lake is stored at rest in your own AWS account.

Job results reside in storage in your account.

Interactive notebook results are stored in a combination of the control plane (partial results for presentation in the UI) and your AWS storage. If you want interactive notebook results stored only in your cloud account storage, you can ask your Databricks representative to enable interactive notebook results in the customer account for your workspace. Note that some metadata about results, such as chart column names, continues to be stored in the control plane. This feature is in Public Preview.

E2 architecture

In September 2020, Databricks released the E2 version of the platform, which provides:

  • Multi-workspace accounts: Create multiple workspaces per account using the Account API 2.0.
  • Customer-managed VPCs: Create Databricks workspaces in your own VPC rather than using the default architecture in which clusters are created in a single AWS VPC that Databricks creates and configures in your AWS account.
  • Secure cluster connectivity: Also known as “No Public IPs,” secure cluster connectivity lets you launch clusters in which all nodes have only private IP addresses, providing enhanced security.
  • Customer-managed keys for managed services: (Public Preview): Provide KMS keys to encrypt notebook and secret data in the Databricks-managed control plane.

Along with features like token management, IP access lists, cluster policies, and IAM credential passthrough, the E2 architecture makes the Databricks platform on AWS more secure, more scalable, and simpler to manage.

New accounts—except for select custom accounts—are created on the E2 platform, and most existing accounts have been migrated. If you are unsure whether your account is on the E2 platform, contact your Databricks representative.