Apache Spark overview
Apache Spark is the technology powering compute clusters and SQL warehouses in Databricks.
This page provides an overview of the documentation in this section.
Get started
Get started working with Apache Spark on Databricks.
Topic | Description |
---|---|
Get answers to frequently asked questions about Apache Spark on Databricks. | |
Tutorial: Load and transform data using Apache Spark DataFrames | Follow a step-by-step guide for working with Spark DataFrames in Python, R, or Scala for data loading and transformation. |
Learn the basics of using PySpark by walking through simple examples. |
Additional resources
Explore other Spark capabilities and documentation.
Topic | Description |
---|---|
Set Spark configuration properties to customize settings in your compute environment and optimize performance. | |
Read an overview of Structured Streaming, a near real-time processing engine. | |
Learn to use the Spark UI for performance tuning, debugging, and cost optimization of Spark jobs. | |
Distributed machine learning using Spark MLlib and integration with popular ML frameworks. |
Spark APIs
Work with Spark using your preferred programming language.
Topic | Description |
---|---|
API reference overview for Apache Spark, including links to reference for Spark SQL, DataFrames, and RDD operations across supported languages. | |
Use Python with Spark including PySpark basics, custom data sources, and Python-specific optimizations. | |
Leverage familiar pandas syntax with the scalability of Spark for distributed data processing. | |
Work with R and Spark using SparkR and sparklyr for statistical computing and data analysis. | |
Build high-performance Spark applications using Scala with native Spark APIs and type safety. |