Databricks architecture overview

The Databricks Unified Data Analytics Platform, from the original creators of Apache Spark, enables data teams to collaborate in order to solve some of the world’s toughest problems.

Databricks excels at enabling data scientists, data engineers, and data analysts to work together on uses cases like:

  • Applying advanced analytics for machine learning and graph processing at scale
  • Using deep learning for harnessing the power of unstructured data such for AI, image interpretation, automatic translation, natural language processing, and more
  • Making data warehousing fast, simple, and scalable
  • Proactively detecting threats with data science and AI
  • Analyzing high-velocity sensor and time-series IoT data in real-time
  • Making GDPR data subject requests easy to execute

High-level architecture

Databricks is structured to enable secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks so you can stay focused on your data science, data analytics, and data engineering tasks.

Although architectures can vary depending on custom configurations, the following diagram represents the most common structure and flow of data for Databricks on AWS environments.

Databricks architecture

Databricks operates out of a control plane and a data plane.

The control plane includes the backend services that Databricks manages in its own AWS account. Any commands that you run will exist in the control plane with your code fully encrypted. Saved commands reside in the data plane.

The data plane is managed by your AWS account and is where your data resides. This is also where data is processed. This diagram assumes that data has already been ingested into Databricks, but you can ingest data from external data sources, such as events data, streaming data, IoT data, and more. You can connect to external data sources outside of your AWS account for storage as well, using Databricks connectors.

Your data always resides in your AWS account in the data plane, not the control plane, so you always maintain full control and ownership of your data without lock-in.

E2 architecture

In September 2020, Databricks released the E2 version of the platform, which provides:

  • Multi-workspace accounts: Create multiple workspaces per account using the Account API.
  • Customer-managed VPCs: Create Databricks workspaces in your own VPC rather than using the default architecture in which clusters are created in a single AWS VPC that Databricks creates and configures in your AWS account.
  • Secure cluster connectivity: Also known as “No Public IPs,” secure cluster connectivity lets you launch clusters in which all nodes have only private IP addresses, providing enhanced security.
  • Customer-managed keys for notebooks: (Public Preview): Provide KMS keys to encrypt notebooks in the Databricks-managed control plane.

Along with features like token management, IP access lists, cluster policies, and IAM credential passthrough, the E2 architecture makes the Databricks platform on AWS more secure, more scalable, and simpler to manage.

New accounts—except for select custom accounts—are created on the E2 platform, and most existing accounts have been migrated. If you are unsure whether your account is on the E2 platform, contact your Databricks representative.