TensorFlow

TensorFlow is an open-source framework for machine learning created by Google. It supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs. It is subject to the terms and conditions of the Apache 2.0 License.

The following sections provide guidance on installing TensorFlow on Databricks and give an example of running TensorFlow programs.

Note

This guide is not a comprehensive guide on TensorFlow. See the TensorFlow website.

TensorFlow versions included in Databricks Runtime ML

Databricks Runtime for Machine Learning includes TensorFlow and TensorBoard so you can use these libraries without installing any packages. Here are the TensorFlow versions included:

Databricks Runtime ML Version TensorFlow Version
7.0 - 7.2 2.2.0
6.3 - 6.6 1.15.0

Install TensorFlow

This section provides instructions for installing or downgrading TensorFlow on Databricks Runtime for Machine Learning and Databricks Runtime, so that you can try out the latest features in TensorFlow. Due to package dependencies, there might be compatibility issues with other pre-installed packages. After installation, you can verify the installed version by executing the following command in a Python notebook:

import tensorflow as tf
print([tf.__version__, tf.test.is_gpu_available()])

Install TensorFlow 2.2 on Databricks Runtime 7.0

Init script:

#!/bin/bash

pip install --no-cache-dir tensorflow-cpu==2.2.*

Install TensorFlow 1.15 on Databricks Runtime 7.0 ML

Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 7.0 ML using %pip and %conda magic commands in a notebook cell. To enable this feature, set the Spark configuration spark.databricks.conda.condaMagic.enabled to true in the cluster settings.

%pip install --force-reinstall --no-cache-dir tensorflow-cpu==1.15.*
%pip install --force-reinstall --no-cache-dir tensorflow-gpu==1.15.*

Install TensorFlow 1.15 on Databricks Runtime 7.0

Init script:

#!/bin/bash

pip install --no-cache-dir tensorflow-cpu==1.15.*

Install TensorFlow 2.2 on Databricks Runtime 6.6 ML

Init script for clusters on:

#!/bin/bash

set -e

pip install tensorflow-cpu==2.2.* setuptools==41.* grpcio==1.24.*
#!/bin/bash

set -e

apt-get remove -y --auto-remove cuda-toolkit-10-0
apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

conda uninstall cudatoolkit tensorboard

pip install tensorflow==2.2.* setuptools==41.* grpcio==1.24.*

Install TensorFlow 2.2 on Databricks Runtime 6.6

Init script:

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==2.2.* setuptools==41.*

Install TensorFlow 2.2 on Databricks Runtime 5.5 LTS ML

Init script for clusters on:

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow-cpu==2.2.* setuptools==41.* grpcio==1.24.*
#!/bin/bash

set -e

apt-get remove -y --auto-remove cuda-toolkit-10-0
apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow==2.2.* setuptools==41.* grpcio==1.24.*

Install TensorFlow 2.2 on Databricks Runtime 5.5 LTS

Init script for clusters on:

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==2.2.* setuptools==41.* pyasn1==0.4.6
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4
#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow==2.2.* setuptools==41.*
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4

TensorFlow 2 known issues

TensorFlow 2 has a known incompatibility with Python pickling. You might encounter it if you use PySpark, HorovodRunner, Hyperopt, or any other packages that depend on pickling. The workaround is to explicitly import TensorFlow modules inside your functions. Here is an example:

import tensorflow as tf

def bad_func(_):
  tf.keras.Sequential()

# You might see an error.
sc.parallelize(range(0)).foreach(bad_func)

def good_func(_):
  import tensorflow as tf
  tf.keras.some_func

# No error.
sc.parallelize(range(0)).foreach(good_func)

Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML

Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML using an init script.

Init script for clusters on:

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-mkl=1.15 setuptools=41
#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-gpu=1.15 setuptools=41

Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS

Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS using an init script.

Init script for clusters on:

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==1.15.* setuptools==41.*
#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends cuda-libraries-10-0 libcudnn7=7.4.2.24-1+cuda10.0

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-gpu==1.15.* setuptools==41.*

TensorBoard

TensorBoard is a suite of visualization tools for debugging, optimizing, and understanding TensorFlow, PyTorch, and other machine learning programs.

Use TensorBoard

Use TensorBoard on Databricks Runtime 7.2 and above

Starting TensorBoard in Databricks is no different than starting it on a Jupyter notebook on your local computer.

  1. Load the %tensorboard magic command and define your log directory.

    %load_ext tensorboard
    experiment_log_dir = <log-directory>
    
  2. Invoke the %tensorboard magic command.

    %tensorboard --logdir $experiment_log_dir
    

    The TensorBoard server starts and displays the user interface inline in the notebook. It also provides a link to open TensorBoard in a new tab.

    The following screenshot shows the TensorBoard UI started in a populated log directory.

    TensorBoard

You can also start TensorBoard by using TensorBoard’s notebook module directly.

from tensorboard import notebook
notebook.start("--logdir {}".format(experiment_log_dir))

Use TensorBoard on Databricks Runtime 7.1 and below

To start TensorBoard from your notebook, use the dbutils.tensorboard utility.

dbutils.tensorboard.start("/tmp/tensorflow_log_dir")

This command displays a link that, when clicked, opens TensorBoard in a new tab.

When started using this API TensorBoard continues to run until you either stop it with dbutils.tensorboard.stop() or you shut down your cluster.

Note

If you attach TensorFlow to your cluster as a Databricks library, you may need to reattach your notebook before starting TensorBoard.

TensorBoard logs and directories

TensorBoard visualizes your machine learning programs by reading logs generated by TensorBoard callbacks and functions in TensorBoard or PyTorch. To generate logs for other machine learning libraries, you can directly write logs using TensorFlow file writers (see Module: tf.summary for TensorFlow 2.x and see Module: tf.compat.v1.summary for the older API in TensorFlow 1.x ).

To make sure that your experiment logs are reliably stored, Databricks recommends writing logs to DBFS (that is, a log directory under /dbfs/) rather than on the ephemeral cluster file system. For each experiment, start TensorBoard in a unique directory. For each run of your machine learning code in the experiment that generates logs, set the TensorBoard callback or filewriter to write to a subdirectory of the experiment directory. That way, the data in the TensorBoard UI will be separated into runs.

Read the official TensorBoard documentation to get started using TensorBoard to log information for your machine learning program.

Manage TensorBoard processes

The TensorBoard processes started within Databricks notebooks are linked to your notebook session and are terminated when the notebook is detached or the REPL is restarted, for example, when you clear the state of the notebook.

To list the TensorBoard servers currently running on your cluster, with their corresponding log directories and process IDs, run notebook.list() from the TensorBoard notebook module.

To manually kill a TensorBoard process, send it a termination signal using %sh kill -15 pid. Improperly killed TensorBoard processes may corrupt notebook.list().

Known issues

  • The --window_title option of TensorBoard is overridden on Databricks.
  • By default, TensorBoard scans a port range for selecting a port to listen to. If there are too many TensorBoard processes running on the cluster, all ports in the port range may be unavailable. You can work around this limitation by specifying a port number with the --port argument. The specified port should be between 6006 and 6106.
  • In order for download links to work, you should open TensorBoard in a tab.
  • When using TensorBoard 1.15.0, the Projector tab is blank. As a workaround, to visit the projector page directly, you can replace #projector in the URL by data/plugin/projector/projector_binary.html.

Use TensorFlow on a single node

To test and migrate single-machine TensorFlow workflows, you can start with a driver-only cluster on Databricks by setting the number of workers to zero. Though Apache Spark is not functional under this setting, it is a cost-effective way to run single-machine TensorFlow workflows. The following notebook shows how you can run TensorFlow (1.x and 2.x), with TensorBoard monitoring on a driver-only cluster.

TensorFlow 1.15/2.x notebook

Open notebook in new tab