Skip to main content

Apache Spark overview

Apache Spark is the technology powering compute clusters and SQL warehouses in Databricks.

This page provides an overview of the documentation in this section.

Get started

Get started working with Apache Spark on Databricks.

Topic

Description

Apache Spark on Databricks

Get answers to frequently asked questions about Apache Spark on Databricks.

Tutorial: Load and transform data using Apache Spark DataFrames

Follow a step-by-step guide for working with Spark DataFrames in Python, R, or Scala for data loading and transformation.

PySpark basics

Learn the basics of using PySpark by walking through simple examples.

Additional resources

Explore other Spark capabilities and documentation.

Topic

Description

Set Spark configuration properties on Databricks

Set Spark configuration properties to customize settings in your compute environment and optimize performance.

Structured Streaming

Read an overview of Structured Streaming, a near real-time processing engine.

Diagnose cost and performance issues using the Spark UI

Learn to use the Spark UI for performance tuning, debugging, and cost optimization of Spark jobs.

Use Apache Spark MLlib on Databricks

Distributed machine learning using Spark MLlib and integration with popular ML frameworks.

Spark APIs

Work with Spark using your preferred programming language.

Topic

Description

Reference for Apache Spark APIs

API reference overview for Apache Spark, including links to reference for Spark SQL, DataFrames, and RDD operations across supported languages.

PySpark

Use Python with Spark including PySpark basics, custom data sources, and Python-specific optimizations.

Pandas API on Spark

Leverage familiar pandas syntax with the scalability of Spark for distributed data processing.

R for Spark

Work with R and Spark using SparkR and sparklyr for statistical computing and data analysis.

Scala for Spark

Build high-performance Spark applications using Scala with native Spark APIs and type safety.