Databricks Runtime 10.1 for Machine Learning (Unsupported)

Databricks Runtime 10.1 for Machine Learning provides a ready-to-go environment for machine learning and data science based on Databricks Runtime 10.1 (Unsupported). Databricks Runtime ML contains many popular machine learning libraries, including TensorFlow, PyTorch, and XGBoost. It also supports distributed deep learning training using Horovod.

For more information, including instructions for creating a Databricks Runtime ML cluster, see Databricks Runtime for Machine Learning.

New features and improvements

Databricks Runtime 10.1 ML is built on top of Databricks Runtime 10.1. For information on what’s new in Databricks Runtime 10.1, including Apache Spark MLlib and SparkR, see the Databricks Runtime 10.1 (Unsupported) release notes.

Enhancements to Databricks AutoML

In Databricks Runtime 10.1, Databricks AutoML includes improved semantic type detection, new alerts for potential data issues during training, new capabilities to prevent overfitting models, and the ability to split the input dataset into train, validation, and test sets chronologically.

Additional semantic type detections

AutoML now supports additional semantic type detection:

  • Numeric columns that contain categorical labels are treated as a categorical type.

  • String columns that contain English text are treated as a text feature.

You can now also add annotations to specify a column data type. For details, see Semantic type detection.

Alerts during training for potential data issues

AutoML now detects and generates alerts for potential issues with the dataset. Example alerts include unsupported column types and high cardinality columns. These alerts appear on the experiment page under the new Alerts tab. Additional information on alerts is included in the data exploration notebook. For more information, see Run the experiment and monitor the results.

Reduced model overfitting

Two new capabilities reduce the chances of overfitting a model when using AutoML:

  • AutoML now reports test metrics in additional to validation and training metrics.

  • AutoML now uses early stopping. It stops training and tuning models if the validation metric is no longer improving.

Split dataset into train/validation/test sets chronologically

For classification and regression problems, you can split the dataset into train, validation, and test sets chronologically. See Split data into train/validation/test sets for details.

Enhancements to Databricks Feature Store

Databricks Feature Store now supports additional data types for feature tables: BinaryType, DecimalType, and MapType. For more information, see Supported data types.

Mlflow

The following improvements are available starting in Mlflow version 1.21.0, which is included in Databricks Runtime 10.1 ML.

  • [Models] Upgrade the fastai model flavor to support fastai v2 (2.4.1 and above).

  • [Models] Introduce an mlflow.prophet model flavor for Prophet time series models.

  • [Scoring] Fix a schema enforcement error that incorrectly cast date-like strings to datetime objects.

Hyperopt

SparkTrials now supports the early_stopping_fn parameter for fmin. You can use the early stopping function to specify conditions when Hyperopt should stop hyperparameter tuning before the maximum number of evaluations is reached. For example, you can use this parameter to end tuning if the objective function is no longer decreasing. For details, see fmin().

Major changes to Databricks Runtime ML Python environment

Python packages upgraded

  • automl 1.3.1 => 1.4.1

  • feature_store 0.3.4 => 0.3.5

  • holidays 0.11.2 => 0.11.3.1

  • horovod 0.22.1 => 0.23.0

  • hyperopt 0.2.5.db2 => 0.2.5.db4

  • imbalanced-learn 0.8.0 => 0.8.1

  • lightgbm 3.1.1 => 3.3.0

  • mlflow 1.20.2 => 1.21.0

  • petastorm 0.11.2 => 0.11.3

  • plotly 5.1.0 => 5.3.0

  • pytorch 1.9.0 => 1.9.1

  • spacy 3.1.2 => 3.1.3

  • sparkdl 2.2.0_db3 => 2.2.0_db4

  • torchvision 0.10.0 => 0.10.1

  • transformers 4.9.2 => 4.11.3

Python packages added

  • fasttext => 0.9.2

  • tensorboard-plugin-profile => 2.5.0

Deprecations

MLlib automated MLflow tracking is deprecated on clusters that run Databricks Runtime 10.1 ML and above. Instead, use MLflow PySpark ML autologging by calling mlflow.pyspark.ml.autolog(). Autologging is enabled by default with Databricks Autologging.

System environment

The system environment in Databricks Runtime 10.1 ML differs from Databricks Runtime 10.1 as follows:

Libraries

The following sections list the libraries included in Databricks Runtime 10.1 ML that differ from those included in Databricks Runtime 10.1.

Python libraries

Databricks Runtime 10.1 ML uses Virtualenv for Python package management and includes many popular ML packages.

In addition to the packages specified in the in the following sections, Databricks Runtime 10.1 ML also includes the following packages:

  • hyperopt 0.2.5.db4

  • sparkdl 2.2.0-db4

  • feature_store 0.3.5

  • automl 1.4.0

Note

Databricks Runtime 10.1 ML includes scikit-learn version 0.24 instead of version 1.0 due to incompatibility issues. The scikit-learn package interacts with many other packages in Databricks Runtime 10.1 ML.

You can upgrade to scikit-learn version 1.0; however, Databricks does not support this version.

To upgrade, use notebook-scoped libraries. From a notebook, run %pip install --upgrade "scikit-learn>=1.0,<1.1".

An alternative is to use this cluster init script:

#!/bin/bash

set -e

pip install --upgrade "scikit-learn>=1.0,<1.1"

Python libraries on CPU clusters

Library

Version

Library

Version

Library

Version

absl-py

0.11.0

Antergos Linux

2015.10 (ISO-Rolling)

appdirs

1.4.4

argon2-cffi

20.1.0

astor

0.8.1

astunparse

1.6.3

async-generator

1.10

attrs

20.3.0

backcall

0.2.0

bcrypt

3.2.0

bleach

3.3.0

blis

0.7.4

boto3

1.16.7

botocore

1.19.7

cachetools

4.2.4

catalogue

2.0.6

certifi

2020.12.5

cffi

1.14.5

chardet

4.0.0

clang

5.0

click

7.1.2

cloudpickle

1.6.0

cmdstanpy

0.9.68

configparser

5.0.1

convertdate

2.3.2

cryptography

3.4.7

cycler

0.10.0

cymem

2.0.5

Cython

0.29.23

databricks-automl-runtime

0.2.3

databricks-cli

0.14.3

dbus-python

1.2.16

decorator

5.0.6

defusedxml

0.7.1

dill

0.3.2

diskcache

5.2.1

distlib

0.3.3

distro-info

0.23ubuntu1

entrypoints

0.3

ephem

4.1

facets-overview

1.0.0

fasttext

0.9.2

filelock

3.0.12

Flask

1.1.2

flatbuffers

1.12

fsspec

0.9.0

future

0.18.2

gast

0.4.0

gitdb

4.0.7

GitPython

3.1.12

google-auth

1.22.1

google-auth-oauthlib

0.4.2

google-pasta

0.2.0

grpcio

1.39.0

gunicorn

20.0.4

gviz-api

1.10.0

h5py

3.1.0

hijri-converter

2.2.2

holidays

0.11.3.1

horovod

0.23.0

htmlmin

0.1.12

huggingface-hub

0.0.19

idna

2.10

ImageHash

4.2.1

imbalanced-learn

0.8.1

importlib-metadata

3.10.0

ipykernel

5.3.4

ipython

7.22.0

ipython-genutils

0.2.0

ipywidgets

7.6.3

isodate

0.6.0

itsdangerous

1.1.0

jedi

0.17.2

Jinja2

2.11.3

jmespath

0.10.0

joblib

1.0.1

joblibspark

0.3.0

jsonschema

3.2.0

jupyter-client

6.1.12

jupyter-core

4.7.1

jupyterlab-pygments

0.1.2

jupyterlab-widgets

1.0.0

keras

2.6.0

Keras-Preprocessing

1.1.2

kiwisolver

1.3.1

koalas

1.8.2

korean-lunar-calendar

0.2.1

lightgbm

3.3.0

llvmlite

0.37.0

LunarCalendar

0.0.9

Mako

1.1.3

Markdown

3.3.3

MarkupSafe

2.0.1

matplotlib

3.4.2

missingno

0.5.0

mistune

0.8.4

mleap

0.18.1

mlflow-skinny

1.21.0

multimethod

1.6

murmurhash

1.0.5

nbclient

0.5.3

nbconvert

6.0.7

nbformat

5.1.3

nest-asyncio

1.5.1

networkx

2.5

nltk

3.6.1

notebook

6.3.0

numba

0.54.1

numpy

1.19.2

oauthlib

3.1.0

opt-einsum

3.3.0

packaging

20.9

pandas

1.2.4

pandas-profiling

3.1.0

pandocfilters

1.4.3

paramiko

2.7.2

parso

0.7.0

pathy

0.6.0

patsy

0.5.1

petastorm

0.11.3

pexpect

4.8.0

phik

0.12.0

pickleshare

0.7.5

Pillow

8.2.0

pip

21.0.1

plotly

5.3.0

preshed

3.0.5

prometheus-client

0.10.1

prompt-toolkit

3.0.17

prophet

1.0.1

protobuf

3.17.2

psutil

5.8.0

psycopg2

2.8.5

ptyprocess

0.7.0

pyarrow

4.0.0

pyasn1

0.4.8

pyasn1-modules

0.2.8

pybind11

2.8.0

pycparser

2.20

pydantic

1.8.2

Pygments

2.8.1

PyGObject

3.36.0

PyMeeus

0.5.11

PyNaCl

1.4.0

pyodbc

4.0.30

pyparsing

2.4.7

pyrsistent

0.17.3

pystan

2.19.1.1

python-apt

2.0.0+ubuntu0.20.4.6

python-dateutil

2.8.1

python-editor

1.0.4

pytz

2020.5

PyWavelets

1.1.1

PyYAML

5.4.1

pyzmq

20.0.0

regex

2021.4.4

requests

2.25.1

requests-oauthlib

1.3.0

requests-unixsocket

0.2.0

rsa

4.7.2

s3transfer

0.3.7

sacremoses

0.0.46

scikit-learn

0.24.1

scipy

1.6.2

seaborn

0.11.1

Send2Trash

1.5.0

setuptools

52.0.0

setuptools-git

1.2

shap

0.39.0

simplejson

3.17.2

six

1.15.0

slicer

0.0.7

smart-open

5.2.0

smmap

3.0.5

spacy

3.1.3

spacy-legacy

3.0.8

spark-tensorflow-distributor

1.0.0

sqlparse

0.4.1

srsly

2.4.1

ssh-import-id

5.10

statsmodels

0.12.2

tabulate

0.8.7

tangled-up-in-unicode

0.1.0

tenacity

6.2.0

tensorboard

2.6.0

tensorboard-data-server

0.6.1

tensorboard-plugin-profile

2.5.0

tensorboard-plugin-wit

1.8.0

tensorflow-cpu

2.6.0

tensorflow-estimator

2.6.0

termcolor

1.1.0

terminado

0.9.4

testpath

0.4.4

thinc

8.0.9

threadpoolctl

2.1.0

tokenizers

0.10.3

torch

1.9.1+cpu

torchvision

0.10.1+cpu

tornado

6.1

tqdm

4.59.0

traitlets

5.0.5

transformers

4.11.3

typer

0.3.2

typing-extensions

3.7.4.3

ujson

4.0.2

unattended-upgrades

0.1

urllib3

1.25.11

virtualenv

20.4.1

visions

0.7.4

wasabi

0.8.2

wcwidth

0.2.5

webencodings

0.5.1

websocket-client

0.57.0

Werkzeug

1.0.1

wheel

0.36.2

widgetsnbextension

3.5.1

wrapt

1.12.1

xgboost

1.4.2

zipp

3.4.1

Python libraries on GPU clusters

Library

Version

Library

Version

Library

Version

absl-py

0.11.0

Antergos Linux

2015.10 (ISO-Rolling)

appdirs

1.4.4

argon2-cffi

20.1.0

astor

0.8.1

astunparse

1.6.3

async-generator

1.10

attrs

20.3.0

backcall

0.2.0

bcrypt

3.2.0

bleach

3.3.0

blis

0.7.4

boto3

1.16.7

botocore

1.19.7

cachetools

4.2.4

catalogue

2.0.6

certifi

2020.12.5

cffi

1.14.5

chardet

4.0.0

clang

5.0

click

7.1.2

cloudpickle

1.6.0

cmdstanpy

0.9.68

configparser

5.0.1

convertdate

2.3.2

cryptography

3.4.7

cycler

0.10.0

cymem

2.0.5

Cython

0.29.23

databricks-automl-runtime

0.2.3

databricks-cli

0.14.3

dbus-python

1.2.16

decorator

5.0.6

defusedxml

0.7.1

dill

0.3.2

diskcache

5.2.1

distlib

0.3.3

distro-info

0.23ubuntu1

entrypoints

0.3

ephem

4.1

facets-overview

1.0.0

fasttext

0.9.2

filelock

3.0.12

Flask

1.1.2

flatbuffers

1.12

fsspec

0.9.0

future

0.18.2

gast

0.4.0

gitdb

4.0.7

GitPython

3.1.12

google-auth

1.22.1

google-auth-oauthlib

0.4.2

google-pasta

0.2.0

grpcio

1.39.0

gunicorn

20.0.4

gviz-api

1.10.0

h5py

3.1.0

hijri-converter

2.2.2

holidays

0.11.3.1

horovod

0.23.0

htmlmin

0.1.12

huggingface-hub

0.0.19

idna

2.10

ImageHash

4.2.1

imbalanced-learn

0.8.1

importlib-metadata

3.10.0

ipykernel

5.3.4

ipython

7.22.0

ipython-genutils

0.2.0

ipywidgets

7.6.3

isodate

0.6.0

itsdangerous

1.1.0

jedi

0.17.2

Jinja2

2.11.3

jmespath

0.10.0

joblib

1.0.1

joblibspark

0.3.0

jsonschema

3.2.0

jupyter-client

6.1.12

jupyter-core

4.7.1

jupyterlab-pygments

0.1.2

jupyterlab-widgets

1.0.0

keras

2.6.0

Keras-Preprocessing

1.1.2

kiwisolver

1.3.1

koalas

1.8.2

korean-lunar-calendar

0.2.1

lightgbm

3.3.0

llvmlite

0.37.0

LunarCalendar

0.0.9

Mako

1.1.3

Markdown

3.3.3

MarkupSafe

2.0.1

matplotlib

3.4.2

missingno

0.5.0

mistune

0.8.4

mleap

0.18.1

mlflow-skinny

1.21.0

multimethod

1.6

murmurhash

1.0.5

nbclient

0.5.3

nbconvert

6.0.7

nbformat

5.1.3

nest-asyncio

1.5.1

networkx

2.5

nltk

3.6.1

notebook

6.3.0

numba

0.54.1

numpy

1.19.2

oauthlib

3.1.0

opt-einsum

3.3.0

packaging

20.9

pandas

1.2.4

pandas-profiling

3.1.0

pandocfilters

1.4.3

paramiko

2.7.2

parso

0.7.0

pathy

0.6.0

patsy

0.5.1

petastorm

0.11.3

pexpect

4.8.0

phik

0.12.0

pickleshare

0.7.5

Pillow

8.2.0

pip

21.0.1

plotly

5.3.0

preshed

3.0.5

prompt-toolkit

3.0.17

prophet

1.0.1

protobuf

3.17.2

psutil

5.8.0

psycopg2

2.8.5

ptyprocess

0.7.0

pyarrow

4.0.0

pyasn1

0.4.8

pyasn1-modules

0.2.8

pybind11

2.8.1

pycparser

2.20

pydantic

1.8.2

Pygments

2.8.1

PyGObject

3.36.0

PyMeeus

0.5.11

PyNaCl

1.4.0

pyodbc

4.0.30

pyparsing

2.4.7

pyrsistent

0.17.3

pystan

2.19.1.1

python-apt

2.0.0+ubuntu0.20.4.6

python-dateutil

2.8.1

python-editor

1.0.4

pytz

2020.5

PyWavelets

1.1.1

PyYAML

5.4.1

pyzmq

20.0.0

regex

2021.4.4

requests

2.25.1

requests-oauthlib

1.3.0

requests-unixsocket

0.2.0

rsa

4.7.2

s3transfer

0.3.7

sacremoses

0.0.46

scikit-learn

0.24.1

scipy

1.6.2

seaborn

0.11.1

Send2Trash

1.5.0

setuptools

52.0.0

setuptools-git

1.2

shap

0.39.0

simplejson

3.17.2

six

1.15.0

slicer

0.0.7

smart-open

5.2.0

smmap

3.0.5

spacy

3.1.3

spacy-legacy

3.0.8

spark-tensorflow-distributor

1.0.0

sqlparse

0.4.1

srsly

2.4.1

ssh-import-id

5.10

statsmodels

0.12.2

tabulate

0.8.7

tangled-up-in-unicode

0.1.0

tenacity

6.2.0

tensorboard

2.6.0

tensorboard-data-server

0.6.1

tensorboard-plugin-profile

2.5.0

tensorboard-plugin-wit

1.8.0

tensorflow

2.6.0

tensorflow-estimator

2.6.0

termcolor

1.1.0

terminado

0.9.4

testpath

0.4.4

thinc

8.0.9

threadpoolctl

2.1.0

tokenizers

0.10.3

torch

1.9.1+cu111

torchvision

0.10.1+cu111

tornado

6.1

tqdm

4.59.0

traitlets

5.0.5

transformers

4.11.3

typer

0.3.2

typing-extensions

3.7.4.3

ujson

4.0.2

unattended-upgrades

0.1

urllib3

1.25.11

virtualenv

20.4.1

visions

0.7.4

wasabi

0.8.2

wcwidth

0.2.5

webencodings

0.5.1

websocket-client

0.57.0

Werkzeug

1.0.1

wheel

0.36.2

widgetsnbextension

3.5.1

wrapt

1.12.1

xgboost

1.4.2

zipp

3.4.1

Spark packages containing Python modules

Spark Package

Python Module

Version

graphframes

graphframes

0.8.2-db1-spark3.2

R libraries

The R libraries are identical to the R Libraries in Databricks Runtime 10.1.

Java and Scala libraries (Scala 2.12 cluster)

In addition to Java and Scala libraries in Databricks Runtime 10.1, Databricks Runtime 10.1 ML contains the following JARs:

CPU clusters

Group ID

Artifact ID

Version

com.typesafe.akka

akka-actor_2.12

2.5.23

ml.combust.mleap

mleap-databricks-runtime_2.12

0.17.0-4882dc3

ml.dmlc

xgboost4j-spark_2.12

1.4.1

ml.dmlc

xgboost4j_2.12

1.4.1

org.graphframes

graphframes_2.12

0.8.1-db6-spark3.2

org.mlflow

mlflow-client

1.20.2

org.mlflow

mlflow-spark

1.20.2

org.scala-lang.modules

scala-java8-compat_2.12

0.8.0

org.tensorflow

spark-tensorflow-connector_2.12

1.15.0

GPU clusters

Group ID

Artifact ID

Version

com.typesafe.akka

akka-actor_2.12

2.5.23

ml.combust.mleap

mleap-databricks-runtime_2.12

0.18.1-23eb1ef

ml.dmlc

xgboost4j-gpu_2.12

1.4.1

ml.dmlc

xgboost4j-spark-gpu_2.12

1.4.1-spark3.2

org.graphframes

graphframes_2.12

0.8.2-db1-spark3.2

org.mlflow

mlflow-client

1.21.0

org.mlflow

mlflow-spark

1.21.0

org.scala-lang.modules

scala-java8-compat_2.12

0.8.0

org.tensorflow

spark-tensorflow-connector_2.12

1.15.0