Databricks for Python developers

This section provides a guide to developing notebooks and jobs in Databricks using the Python language.

Python APIs

PySpark API

PySpark is the Python API for Apache Spark. These links provide an introduction to and reference for PySpark.

Pandas API on Spark

Note

This feature is available on clusters that run Databricks Runtime 10.0 and Databricks Runtime 10.0 Photon and above. For clusters that run Databricks Runtime 9.1 LTS and Databricks Runtime 9.1 LTS Photon and below, use Koalas instead.

pandas is a Python package commonly used by data scientists. However, pandas does not scale out to big data. Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark.

Koalas

Note

This feature is deprecated on clusters that run Databricks Runtime 10.0 and Databricks Runtime 10.0 Photon and above. For clusters that run Databricks Runtime 10.0 and above, use Pandas API on Spark instead.

Koalas provides a drop-in replacement for pandas. pandas is a Python package commonly used by data scientists. However, pandas does not scale out to big data. Koalas fills this gap by providing pandas equivalent APIs that work on Apache Spark.

Visualizations

Databricks Python notebooks support various types of visualizations using the display function.

You can also use the following third-party libraries to create visualizations in Databricks Python notebooks.

Interoperability

These articles describe features that support interoperability between PySpark and pandas.

This article describes features that support interoperability between Python and SQL.

Notebooks

For information about working with Python in Databricks notebooks, see Use notebooks. For instance:

  • You can override a notebook’s default language by specifying the language magic command %<language> at the beginning of a cell. For example, you can run Python code in a cell within a notebook that has a default language of R, Scala, or SQL. For Python, the language magic command is %python.
  • In Databricks Runtime 7.4 and above, you can display Python docstring hints by pressing Shift+Tab after entering a completable Python object.
  • Python notebooks support error highlighting. The line of code that throws the error is highlighted in the cell.

Tools

In addition to Databricks notebooks, you can use the following Python developer tools:

For information about additional tools for working with Databricks, see Developer tools and guidance.

Libraries

  • The Databricks SQL Connector for Python is a Python library that allows you to use Python code to run SQL commands on Databricks resources.
  • pyodbc allows you to connect from your local Python code through ODBC to data in Databricks resources.
  • Databricks runtimes include many popular libraries. You can also install additional third-party or custom Python libraries to use with notebooks and jobs running on Databricks clusters.

Cluster-based libraries

Cluster-based libraries are available to all notebooks and jobs running on the cluster.

Notebook-scoped libraries

Notebook-scoped libraries are available only to the notebook on which they are installed and must be reinstalled for each session.

Machine learning

For general information about machine learning on Databricks, see Databricks Machine Learning guide.

To get started with machine learning using the scikit-learn library, use the following notebook. It covers data loading and preparation; model training, tuning, and inference; and model deployment and management with MLflow.

10-minute tutorial: machine learning on Databricks with scikit-learn

To get started with GraphFrames, a package for Apache Spark that provides DataFrame-based graphs, use the following notebook. It covers creating GraphFrames from vertex and edge DataFrames, peforming simple and complex graph queries, building subgraphs, and using standard graph algorithms such as breadth-first search and shortest paths.

GraphFrames Python notebook

Jobs

You can run a Python script by calling the Create a new job operation (POST /jobs/create) in the Jobs API, specifying the spark_python_task field in the request body.