Third-Party Machine Learning Integrations

This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Databricks. Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.

H2O Sparkling Water

H2O is an open source machine learning project for distributed machine learning much like Apache Spark(tm). These notebooks describe how to integrate with H2O using the Sparkling Water module.

H2O Flow

The H2O Flow UI provides user-friendly clickable interface to machine learning and also provides some useful visualizations to view your jobs. To enable H2O Flow on a Databricks cluster, you must set up an ssh tunnel to the Spark driver. After H2O starts, run the following on your Spark driver:

ssh ubuntu@<hostname> -p 2200 -i <private-key> -L 54321:localhost:54321

You should be able to access H2O Flow on localhost:54321.

scikit-learn

scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks Runtime Release Notes for the scikit-learn library version included with your cluster’s runtime.

DataRobot

The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments. Read more at DataRobot.

XGBoost

XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.

XGBoost versions

There are two versions of XGBoost: a Python version, which is not distributed, and a Scala-based Spark version, which supports distributed training.

Single node training

To install the non-distributed Python version, run:

/databricks/python/bin/pip install xgboost --pre

This Python version allows you to train only single node workloads.

Distributed training

In order to perform distributed training, you must leverage XGBoost’s Scala/Java packages.

Install XGBoost

Note

XGBoost is included in Databricks Runtime ML (Beta), a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing XGBoost using the instructions below, you can simply create a cluster using Databricks Runtime ML. See Databricks Runtime ML.

We recommend that you install the XGBoost Spark Package, a version of XGBoost built for Databricks. If you require an XGBoost version that is not compatible with the provided package, you can build from source in a compatible Linux environment or inside an init script.

Install XGBoost Spark Package

To install the XGBoost library and attach it to your cluster, follow the instructions in Libraries, using xgboost-linux64 as the Spark Package name.

Build XGBoost from source and install

Important

Be sure to compile against the right version of Spark. You might need to modify the build configuration.

  1. Build XGBoost.

    You must build XGBoost on the same Linux version that you’re using on Databricks. Building on a MacOS or Windows machine does not work, because XGBoost requires compiling certain dependencies in the JAR that you upload to Databricks. To build XGBoost, take a look at the XGBoost documentation, specifically the JVM Packaging Documentation. Here is an example of how to build it:

    sudo apt-get update
    sudo apt-get install -y maven
    sudo apt-get install -y cmake
    
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    git clone --recursive https://github.com/dmlc/xgboost
    
    cd xgboost/jvm-packages
    mvn -DskipTests package
    

    If you build on an EC2 machine, you can copy the created JAR to your local machine like this:

    scp -P 2200 your-instance-info.amazonaws.com:/home/ubuntu/xgboost/jvm-packages/xgboost4j-spark/target/xgboost4j-spark-*-jar-with-dependencies.jar ./
    
  2. Upload the JAR file for xgboost4j-spark-{version}-with-dependencies as a library and attach it to your cluster.

Build XGBoost from source and install using an init script

Important

Building XGboost inside an init script can significantly increase your cluster start up time.