Skip to main content

Data engineering with Databricks

Databricks provides an end-to-end data engineering solution that empowers data engineers, software developers, SQL developers, analysts, and data scientists to deliver high-quality data for downstream analytics, AI, and operational applications.

The following image shows the architecture of the Databricks data engineering systems, including Jobs, Lakeflow Connect, DLT, and the Databricks Runtime.

Databricks data engineering overview

  • Lakeflow Connect simplifies data ingestion with connectors to popular enterprise applications, databases, cloud storage, message buses, and local files. A subset of these connectors are available as managed connectors. Managed connectors provide a simple UI and a configuration-based ingestion service with minimum operational overhead, without requiring you to use the underlying DLT APIs and infrastructure.

  • DLT has a declarative framework that lowers the complexity of building and managing efficient batch and streaming data pipelines. DLT runs on the performance-optimized Databricks Runtime, and the DLT flows API uses the same DataFrame API as Apache Spark and Structured Streaming. A flow can write into streaming tables and sinks, such as a Kafka topic, using streaming semantics, or it can write to a materialized view using batch semantics. In addition, DLT automatically orchestrates the execution of flows, sinks, streaming tables, and materialized views by encapsulating and running them as a pipeline.

  • Jobs provides reliable orchestration and production monitoring for any data and AI workload. A job can consist of one or more tasks that run notebooks, pipelines, managed connectors, SQL queries, machine learning training, and model deployment and inference. Jobs also support custom control flow logic, such as branching with if / else statements, and looping with for each statements.

  • Databricks Runtime for Apache Spark is a reliable and performance-optimized compute environment for running Spark workloads, including batch and streaming. Databricks Runtime provides Photon, a high-performance Databricks-native vectorized query engine, and various infrastructure optimizations like autoscaling. You can run your Spark and Structured Streaming workloads on the Databricks Runtime by building your Spark programs as notebooks, JARs, or Python wheels.

Additional resources