Databricks for Python developers

This section provides a guide to developing notebooks and jobs in Databricks using the Python language.

Python APIs

PySpark API

PySpark is the Python API for Apache Spark. These links provide an introduction to and reference for PySpark.

pandas API (Koalas)

pandas is a Python API that makes working with “relational” data easy and intuitive. Koalas implements the pandas DataFrame API for Apache Spark.


Databricks Python notebooks support various types of visualizations using the display function.

You can also use the following third-party libraries to create visualizations in Databricks Python notebooks.


These articles describe features that support interoperability between PySpark and pandas.

This article describes features that support interoperability between Python and SQL.


In addition to Databricks notebooks, you can use the following Python developer tools:


Databricks runtimes include many popular libraries. You can also install additional third-party or custom Python libraries to use with notebooks and jobs running on Databricks clusters.

Cluster-based libraries

Cluster-based libraries are available to all notebooks and jobs running on the cluster. For information about installing cluster-based libraries, see Install a library on a cluster.

Notebook-scoped libraries

Notebook-scoped libraries are available only to the notebook on which they are installed and must be reinstalled for each session.

  • For an overview of different options you can use to install Python libraries within Databricks, see Python environment management.
  • For information about notebook-scoped libraries in Databricks Runtime 6.4 ML and above and Databricks Runtime 7.1 and above, see Notebook-scoped Python libraries.
  • For information about notebook-scoped libraries in Databricks Runtime 7.0 and below, see Library utilities.

Machine learning

For general information about machine learning on Databricks, see Machine learning and deep learning guide.

To get started with machine learning using the scikit-learn library, use the following notebook. It covers data loading and preparation; model training, tuning, and inference; and model deployment and management with MLflow.

10-minute tutorial: machine learning on Databricks with scikit-learn