Koalas

Koalas is an open source project that provides a drop-in replacement for pandas. Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Koalas fills this gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful not only for pandas users but also PySpark users, because Koalas supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.

Requirements

  • Koalas is included on clusters running Databricks Runtime 7.3 and above.
  • To use Koalas on a cluster running Databricks Runtime 7.0 or below, install Koalas as a Databricks PyPI library.
  • To use Koalas in an IDE, notebook server, or other custom applications that connect to a Databricks cluster, install Databricks Connect and follow the Koalas installation instructions.

Notebook

The following notebook shows how to migrate from pandas to Koalas.

pandas to Koalas notebook

Open notebook in new tab