Koalas

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported. See Pandas API on Spark.

Note

Koalas is deprecated. If you try using Koalas on clusters that run Databricks Runtime 10.0 (unsupported) and above, an informational message displays, recommending that you use Pandas API on Spark instead.

Koalas provides a drop-in replacement for pandas. Commonly used by data scientists, pandas is a Python package that provides easy-to-use data structures and data analysis tools for the Python programming language. However, pandas does not scale out to big data. Koalas fills this gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful not only for pandas users but also PySpark users, because Koalas supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame.

Requirements

  • Koalas is included on clusters running Databricks Runtime 7.3 through 9.1. For clusters running Databricks Runtime 10.0 and above, use Pandas API on Spark instead.

  • To use Koalas on a cluster running Databricks Runtime 7.0 or below, install Koalas as a Databricks PyPI library.

  • To use Koalas in an IDE, notebook server, or other custom applications that connect to a Databricks cluster, install Databricks Connect and follow the Koalas installation instructions.

Notebook

The following notebook shows how to migrate from pandas to Koalas.

pandas to Koalas notebook

Open notebook in new tab