Databricks Runtime with Conda

Beta

This is a Beta release. It is intended for experimental use cases and not for production workloads.

Databricks Runtime with Conda is a Databricks runtime based on Conda environments instead of Python virtual environments (virtualenvs). Databricks Runtime with Conda provides an updated and optimized list of default packages and a flexible Python environment for advanced users who require maximum control over packages and environments.

What is Conda?

Conda is an open source package and environment management system. As a package manager you use Conda to install Python packages from your desired channels (or repositories). Databricks Runtime with Conda uses the Anaconda repository. As an environment manager you use Conda to easily create, save, load, and switch between Python environments. Conda environments are compatible with PyPI packages.

What is in Databricks Runtime with Conda?

Databricks Runtime with Conda is available with two installed Conda environments: databricks-standard and databricks-minimal.

  • databricks-standard environment includes updated versions of many popular Python packages. This environment is intended as a drop-in replacement for existing notebooks that run on Databricks Runtime. This is the default Databricks Conda-based runtime environment.
  • databricks-minimal environment contains a minimum number of packages that are required for PySpark and Databricks Python notebook functionality. This environment is ideal if you want to customize the runtime with various Python packages.

The packages included in each environment are listed in the Databricks Runtime Release Notes.

Manage environments

One of the key advantages of Conda package management system is first-class support for environments.

Root environments

Databricks Runtime with Conda is available with two default installed Conda environments: databricks-standard and databricks-minimal. We refer to these as root environments.

Select a root environment

When launching a cluster using cluster UI running Databricks Runtime with Conda, you can pick one of the two environments to be activated by setting DATABRICKS_ROOT_CONDA_ENV environment variable on the cluster. Acceptable values are databricks-standard (default) and databricks-minimal.

You can also launch clusters using the REST API. Here is an example request that launches a cluster with the databricks-minimal environment.

{
  "cluster_name": "my-cluster",
  "spark_version": "5.4.x-conda-scala2.11",
  "node_type_id": "i3.xlarge",
  "spark_env_vars": {
    "DATABRICKS_ROOT_CONDA_ENV": "databricks-minimal"
  },
  "aws_attributes": {
    "availability": "SPOT",
    "zone_id": "us-west-2a"
  },
  "num_workers": 10
}

Important

  • Since the root environment determines what Python binary executable will be used, PYSPARK_PYTHON no longer determines what Python binary executable will be used. The value set to PYSPARK_PYTHON is ignored.
  • You cannot create and activate new environments inside Databricks notebooks. Every notebook operates in a unique environment that is cloned from the root environment.
  • You cannot switch environments using notebook shell commands.

Environment activation

Each Databricks notebook clones the root environment and activates the new environment before executing the first command. This offers several benefits:

  • All package management activity inside the notebook is isolated from other notebooks.
  • You can use conda and pip commands without having to worry about the location of the root environment.

Manage Python libraries

Similar to standard Databricks Runtime versions, Databricks Runtime with Conda supports three library modes: Workspace, cluster-installed, and notebook-scoped. This section reviews these options in the context of the Conda Python packages.

Workspace and cluster-scoped libraries

You can install all supported Python library formats (whl, wheelhouse.zip, egg, and PyPI) on clusters running Databricks Runtime with Conda. The Databricks library manager uses the pip command provided by Conda to install packages in the root Conda environment. The packages are accessible by all notebooks and jobs attached to the cluster.

Note

If a notebook is attached to a cluster before a Workspace library is attached, you must detach and reattach the notebook to the cluster to use the new library.

Notebook-scoped libraries

You can install Python libraries using Databricks Library utilities. The libraries will be installed in the notebook’s Conda env and only accessible within the notebook. Databricks Runtime with Conda supports two methods for installing packages using Databricks Library utilities: requirements files and YAML package specifications.

Requirements files

A requirements file contains a list of packages to be installed using pip. You can install a requirements file as a notebook-scoped library similar to how you install whl and egg libraries. The name of the file must end with requirements.txt. An example of using a requirements file is:

dbutils.library.install("dbfs:/path/to/file/a_requirements.txt")

See Requirements File Format for more information on requirements.txt files.

YAML package specifications

You can use the YAML format to install notebook-scoped libraries on Databricks Runtime 5.5 with Conda and above. The format specifies a list of packages to be installed using Conda, and the channels that the packages will be installed from. An example of installing a package specified in YAML format is:

dbutils.library.updateCondaEnv(
"""channels:
  - default
dependencies:
  - numpy=1.16.4""")

Use conda and pip commands

Every Python notebook (or Python cell) that is attached to a Databricks Runtime with Conda runs in an activated Conda environment. Therefore you can use conda and pip commands to list and install packages. Any modifications to the current environment using this method are restricted to the notebook and the driver. The changes are reset when you detach and reattach the notebook.

%sh
conda env list
%sh
conda install matplotlib -y

Tip

When you run shell commands inside notebooks using %sh, you cannot respond to interactive shells. To avoid blocking, pass the -y (--yes) flag to conda and pip commands.

Use Conda inside cluster initialization scripts

To install Conda packages using cluster initialization scripts, you can assume that your script is running inside the activated root Conda environment, either databricks-minimal or databricks-standard. Any packages installed using conda or pip commands with init scripts are exposed to all notebooks attached to the cluster. As an example, the following notebook code snippet generates a script that installs fast.ai packages on all the cluster workers.

dbutils.fs.put("dbfs:/home/myScripts/fast.ai", "conda install -c pytorch -c fastai fastai -y", True)

Use Conda when SSHing into containers

When you SSH into the container on your cluster’s driver node, you log in as the user ubuntu. While the root Conda environment is activated for this user, it does not have the privilege to modify it. That is, if you try to install Conda or pip packages as ubuntu it will fail with the following error message:

EnvironmentNotWritableError: The current user does not have write permissions to the target environment.

The solution is to switch to the root user (using sudo su - command) before modifying the root environment.

Important

Prefixing conda install commands with sudo does not work. You must switch to the root user in an interactive shell.

Limitations

The following features are not supported on Databricks Runtime with Conda:

  • GPU instances. If you are looking for a Conda-based runtime that supports GPUs, consider Databricks Runtime for ML.
  • Python 2.7
  • Creating and activating a separate environment before a cluster running Databricks Runtime with Conda starts.
  • Data-centric security features, including Table ACLs.

Note

Databricks Runtime with Conda is an experimental and evolving feature. We are actively working to improve it and resolve all the limitations. We recommend using the latest released version of Databricks Runtime with Conda.

FAQ

When should I use Databricks Container Service (DCS) versus when should I use Conda to customize my Python environment?
If your desired customizations are restricted to Python packages, you can start with databricks-minimal Conda environment and customize it based on your needs. However, if you need JVM, Native or R customizations, DCS would be a better choice. In addition, if you need to install many packages and cluster startup time becomes a bottleneck, then using DCS can help you launch clusters faster.
When should I use init scripts vs. cluster-installed libraries?
We recommend using cluster-scoped libraries to install Python libraries that are needed by all users/notebooks of a cluster as much as possible. You have more flexibility and visibility with cluster-scoped libraries. Cluster initialization scripts can be used when cluster or notebook scoped libraries are not sufficient. For example, when installing native packages that are not supported by Databricks Library manager.