This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks. Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.
The H2O Flow UI provides user-friendly clickable interface to machine learning and also provides some useful visualizations to view your jobs. To enable H2O Flow on a Databricks cluster, you must set up an ssh tunnel to the Apache Spark driver. After H2O starts, run the following on your Spark driver:
ssh ubuntu@<hostname> -p 2200 -i <private-key> -L 54321:localhost:54321
You should be able to access H2O Flow on localhost:54321.
Databricks Runtime for Machine Learning installs XGBoost, which conflicts with the XGBoost packaged in PySparkling. To use PySparkling on Databricks Runtime ML, you must remove XGBoost using an init script:
#!/bin/bash rm /databricks/jars/spark--maven-trees--ml--xgboost*
scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks Runtime Release Notes for the scikit-learn library version included with your cluster’s runtime.
The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. Read more at DataRobot.
XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.
XGBoost is included in Databricks Runtime ML. The Python version is included in Databricks Runtime 5.4 ML and above. You can use these libraries in Databricks Runtime ML without installing any packages. See Databricks Runtime for Machine Learning. The XGBoost versions included are:
|Databricks Runtime ML Version||Python Version||Scala/Java Version|
|5.1 - 5.3||N/A||0.81|
To install other Python versions in Databricks Runtime ML, install XGBoost as a Databricks PyPI library. Specify it as the following and replace
<xgboost version> with the desired version.
Python package: Use Databricks Library Utilities by executing the command below in a notebook cell and replace
<xgboost version>with the desired version:
dbutils.library.installPyPI("xgboost", version="<xgboost version>" )
Scala/Java packages: Install as a Databricks library with the Spark Package name
The Python package allows you to train only single node workloads.
To perform distributed training, you must use XGBoost’s Scala/Java packages.