PySpark on Databricks

This article describes the fundamentals of PySpark, a Python API for Spark, on Databricks.

Databricks is built on top of Apache Spark, a unified analytics engine for big data and machine learning. PySpark helps you interface with Apache Spark using the Python programming language, which is a flexible language that is easy to learn, implement, and maintain. It also provides many options for data visualization in Databricks. PySpark combines the power of Python and Apache Spark.

APIs and libraries

As with all APIs for Spark, PySpark comes equipped with many APIs and libraries that enable and support powerful functionality, including:

  • Processing of structured data with relational queries with Spark SQL and DataFrames. Spark SQL allows you to mix SQL queries with Spark programs. With Spark DataFrames, you can efficiently read, write, transform, and analyze data using Python and SQL, which means you are always leveraging the full power of Spark. See PySpark Getting Started.

  • Scalable processing of streams with Structured Streaming. You can express your streaming computation the same way you would express a batch computation on static data and the Spark SQL engine runs it incrementally and continuously as streaming data continues to arrive. See Structured Streaming Overview.

  • Pandas data structures and data analysis tools that work on Apache Spark with Pandas API on Spark. Pandas API on Spark allows you to scale your pandas workload to any size by running it distributed across multiple nodes, with a single codebase that works with pandas (tests, smaller datasets) and with Spark (production, distributed datasets). See Pandas API on Spark Overview.

  • Machine learning algorithms with Machine Learning (MLLib). MLlib is a scalable machine learning library built on Spark that provides a uniform set of APIs that help users create and tune practical machine learning pipelines. See Machine Learning Library Overview.

  • Graphs and graph-parallel computation with GraphX. GraphX introduces a new directed multigraph with properties attached to each vertex and edge, and exposes graph computation operators, algorithms, and builders to simplify graph analytics tasks. See GraphX Overview.

DataFrames, transformations, and lazy evaluation

Apache Spark DataFrames are datasets organized into named columns. They are two-dimensional labeled data structures with columns of different types. DataFrames provide a rich set of functions that allow you to solve common data analysis problems efficiently, and they make it easy to transform data with built-in methods to sort, filter, and aggregate data.

Fundamental to Apache Spark are two categories of data processing operations: transformations and actions. An action operation returns a value, such as count, first, and collect. A transformation operation, such as filter or groupBy, returns a DataFrame but it doesn’t execute until an action triggers it. This is known as lazy evaluation. Lazy evaluation also allows you to chain multiple operations because Spark handles their execution in a deferred manner, rather than immediately executing them when they are defined.

Spark tutorials

In addition to the Apache Spark Tutorial which walks you through loading and transforming data using DataFrames, the Apache Spark documentation also has quickstarts and guides for learning Spark, including the following articles:

PySpark reference

Databricks maintains its own version of the PySpark APIs and corresponding reference, which can be found in these sections: