Data engineering with Databricks
Databricks provides an end-to-end data engineering solution that empowers data engineers, software developers, SQL developers, analysts, and data scientists to deliver high-quality data for downstream analytics, AI, and operational applications.
The following image shows the architecture of the Databricks data engineering systems, including Jobs, Lakeflow Connect, DLT, and the Databricks Runtime.
-
Lakeflow Connect simplifies data ingestion with connectors to popular enterprise applications, databases, cloud storage, message buses, and local files. A subset of these connectors are available as managed connectors. Managed connectors provide a simple UI and a configuration-based ingestion service with minimum operational overhead, without requiring you to use the underlying DLT APIs and infrastructure.
-
DLT has a declarative framework that lowers the complexity of building and managing efficient batch and streaming data pipelines. DLT runs on the performance-optimized Databricks Runtime, and the DLT flows API uses the same DataFrame API as Apache Spark and Structured Streaming. A flow can write into streaming tables and sinks, such as a Kafka topic, using streaming semantics, or it can write to a materialized view using batch semantics. In addition, DLT automatically orchestrates the execution of flows, sinks, streaming tables, and materialized views by encapsulating and running them as a pipeline.
-
Jobs provides reliable orchestration and production monitoring for any data and AI workload. A job can consist of one or more tasks that run notebooks, pipelines, managed connectors, SQL queries, machine learning training, and model deployment and inference. Jobs also support custom control flow logic, such as branching with if / else statements, and looping with for each statements.
-
Databricks Runtime for Apache Spark is a reliable and performance-optimized compute environment for running Spark workloads, including batch and streaming. Databricks Runtime provides Photon, a high-performance Databricks-native vectorized query engine, and various infrastructure optimizations like autoscaling. You can run your Spark and Structured Streaming workloads on the Databricks Runtime by building your Spark programs as notebooks, JARs, or Python wheels.
Additional resources
- Data engineering concepts describes data engineering concepts in Databricks.
- What is Delta Lake? is the optimized storage layer that provides the foundation for tables in a lakehouse in Databricks.
- To learn about best practices for data engineering in Databricks, see Data engineering best practices.
- Databricks notebooks are a popular tool for collaboration and development.
- If you primarily work with SQL queries and BI tools, see Databricks SQL.
- See Databricks Mosaic AI if you are architecting machine learning solutions.