%md
# Automated MLflow tracking in MLlib
MLflow provides automated tracking for model tuning with MLlib. With automated MLflow tracking, when you run tuning code using `CrossValidator` or `TrainValidationSplit`, the specified hyperparameters and evaluation metrics are automatically logged, making it easy to identify the optimal model.
This notebook shows an example of automated MLflow tracking with MLlib.
This notebook uses the PySpark classes `DecisionTreeClassifier` and `CrossValidator` to train and tune a model. MLflow automatically tracks the learning process, saving the results of each run so you can examine the hyperparameters to understand the impact of each one on the model's performance and find the optimal settings.
This notebook uses the MNIST handwritten digit recognition dataset, which is included with Databricks.
%md ## Load the training and test datasets
The dataset is already divided into training and test sets. Each dataset has two columns: an image, represented as a vector of 784 pixels, and a "label", or the actual number shown in the image.
The datasets are stored in the LIBSVM dataset format. Load them using the MLlib LIBSVM dataset reader utility.
Load the training and test datasets
The dataset is already divided into training and test sets. Each dataset has two columns: an image, represented as a vector of 784 pixels, and a "label", or the actual number shown in the image.
The datasets are stored in the LIBSVM dataset format. Load them using the MLlib LIBSVM dataset reader utility.
Last refresh: Never
training = spark.read.format("libsvm") \
.option("numFeatures", "784") \
.load("/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt")
test = spark.read.format("libsvm") \
.option("numFeatures", "784") \
.load("/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt")
training.cache()
test.cache()
print("There are {} training images and {} test images.".format(training.count(), test.count()))
display(training)
Showing the first 715 rows.
Last refresh: Never
%md ## Define the ML pipeline
In this example, as with most ML applications, you must do some preprocessing of the data before you can use the data to train a model. MLlib provides **pipelines** that allow you to combine multiple steps into a single workflow. In this example, you build a two-step pipeline:
1. `StringIndexer` converts the labels from numeric values to categorical values.
2. `DecisionTreeClassifier` trains a decision tree that can predict the "label" column based on the data in the "features" column.
For more information:
[Pipelines](http://spark.apache.org/docs/latest/ml-pipeline.html#ml-pipelines)
Define the ML pipeline
In this example, as with most ML applications, you must do some preprocessing of the data before you can use the data to train a model. MLlib provides pipelines that allow you to combine multiple steps into a single workflow. In this example, you build a two-step pipeline:
StringIndexer
converts the labels from numeric values to categorical values.DecisionTreeClassifier
trains a decision tree that can predict the "label" column based on the data in the "features" column.
For more information:
Pipelines
Last refresh: Never
# StringIndexer: Convert the input column "label" (digits) to categorical values
indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
# DecisionTreeClassifier: Learn to predict column "indexedLabel" using the "features" column
dtc = DecisionTreeClassifier(labelCol="indexedLabel")
# Chain indexer + dtc together into a single ML Pipeline
pipeline = Pipeline(stages=[indexer, dtc])
%md ## Run the cross-validation
Now that you have defined the pipeline, you can run the cross validation to tune the model's hyperparameters. During this process, MLflow automatically tracks the models produced by `CrossValidator`, along with their evaluation metrics. This allows you to investigate how specific hyperparameters affect the model's performance.
In this example, you examine two hyperparameters in the cross-validation:
* `maxDepth`. This parameter determines how deep, and thus how large, the tree can grow.
* `maxBins`. For efficient distributed training of Decision Trees, MLlib discretizes (or "bins") continuous features into a finite number of values. The number of bins is controlled by `maxBins`. In this example, the number of bins corresponds to the number of grayscale levels; `maxBins=2` turns the images into black and white images.
For more information:
[maxBins](https://spark.apache.org/docs/latest/mllib-decision-tree.html#split-candidates)
[maxDepth](https://spark.apache.org/docs/latest/mllib-decision-tree.html#stopping-rule)
Run the cross-validation
Now that you have defined the pipeline, you can run the cross validation to tune the model's hyperparameters. During this process, MLflow automatically tracks the models produced by CrossValidator
, along with their evaluation metrics. This allows you to investigate how specific hyperparameters affect the model's performance.
In this example, you examine two hyperparameters in the cross-validation:
maxDepth
. This parameter determines how deep, and thus how large, the tree can grow.maxBins
. For efficient distributed training of Decision Trees, MLlib discretizes (or "bins") continuous features into a finite number of values. The number of bins is controlled bymaxBins
. In this example, the number of bins corresponds to the number of grayscale levels;maxBins=2
turns the images into black and white images.
Last refresh: Never
%md Run `CrossValidator`. If an MLflow tracking server is available, `CrossValidator` automatically logs each run to MLflow, along with the evaluation metric calculated on the held-out data, under the current active run. If no run is active, a new one is created.
Run CrossValidator
. If an MLflow tracking server is available, CrossValidator
automatically logs each run to MLflow, along with the evaluation metric calculated on the held-out data, under the current active run. If no run is active, a new one is created.
Last refresh: Never
# Explicitly create a new run.
# This allows this cell to be run multiple times.
# If you omit mlflow.start_run(), then this cell could run once, but a second run would hit conflicts when attempting to overwrite the first run.
import mlflow
import mlflow.spark
with mlflow.start_run():
# Run the cross validation on the training dataset. The cv.fit() call returns the best model it found.
cvModel = cv.fit(training)
# Evaluate the best model's performance on the test dataset and log the result.
test_metric = evaluator.evaluate(cvModel.transform(test))
mlflow.log_metric('test_' + evaluator.getMetricName(), test_metric)
# Log the best model.
mlflow.spark.log_model(spark_model=cvModel.bestModel, artifact_path='best-model')
%md To view the MLflow experiment associated with the notebook, click the **Experiment** icon in the notebook context bar on the upper right. All notebook runs appear in the sidebar.
To more easily compare their results, click the icon at the far right of **Experiment Runs** (it shows "View Experiment UI" when you hover over it). The Experiment page appears.
For example, to examine the effect of tuning `maxDepth`:
1. On the Experiment page, enter `params.maxBins = "8"` in the **Search Runs** box, and click **Search**.
1. Select the resulting runs and click **Compare**.
1. In the Scatter Plot, select X-axis **maxDepth** and Y-axis **avg_weightedPrecision**.
You can see that, when `maxBins` is held constant at 8, the average weighted precision increases with `maxDepth`.
To view the MLflow experiment associated with the notebook, click the Experiment icon in the notebook context bar on the upper right. All notebook runs appear in the sidebar. To more easily compare their results, click the icon at the far right of Experiment Runs (it shows "View Experiment UI" when you hover over it). The Experiment page appears.
For example, to examine the effect of tuning maxDepth
:
- On the Experiment page, enter
params.maxBins = "8"
in the Search Runs box, and click Search. - Select the resulting runs and click Compare.
- In the Scatter Plot, select X-axis maxDepth and Y-axis avg_weightedPrecision.
You can see that, when maxBins
is held constant at 8, the average weighted precision increases with maxDepth
.
Last refresh: Never
Automated MLflow tracking in MLlib
MLflow provides automated tracking for model tuning with MLlib. With automated MLflow tracking, when you run tuning code using
CrossValidator
orTrainValidationSplit
, the specified hyperparameters and evaluation metrics are automatically logged, making it easy to identify the optimal model.This notebook shows an example of automated MLflow tracking with MLlib.
This notebook uses the PySpark classes
DecisionTreeClassifier
andCrossValidator
to train and tune a model. MLflow automatically tracks the learning process, saving the results of each run so you can examine the hyperparameters to understand the impact of each one on the model's performance and find the optimal settings.This notebook uses the MNIST handwritten digit recognition dataset, which is included with Databricks.
Last refresh: Never