Reference for Apache Spark APIs
Databricks is built on top of Apache Spark, a unified analytics engine for big data and machine learning. For more information, see Apache Spark on Databricks.
Apache Spark has DataFrame APIs for operating on large datasets, which include over 100 operators, in several languages.
PySpark APIs for Python developers. See Tutorial: Load and transform data using Apache Spark DataFrames. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API.
DataFrame - A distributed collection of data grouped into named columns. See DataFrames and DataFrame-based MLlib.
SparkR APIs for R developers. Key classes include:
SparkSession - SparkSession is the entry point into SparkR. See Starting Point: SparkSession.
SparkDataFrame - A distributed collection of data grouped into named columns. See Datasets and DataFrames, Creating DataFrames, and Creating SparkDataFrames.
Scala APIs for Scala developers. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API. See Starting Point: SparkSession.
Dataset - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each
Dataset
also has an untyped view called a DataFrame, which is aDataset
of Row. See Datasets and DataFrames, Creating Datasets, Creating DataFrames, and DataFrame functions.
Java APIs for Java developers. Key classes include:
SparkSession - The entry point to programming Spark with the Dataset and DataFrame API. See Starting Point: SparkSession.
Dataset - A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each
Dataset
also has an untyped view called a DataFrame, which is aDataset
of Row. See Datasets and DataFrames, Creating Datasets, Creating DataFrames, and DataFrame functions.
To learn how to use the Apache Spark APIs on Databricks, see:
For Java, you can run Java code as a JAR job.