hyperopt-spark-ml(Python)

Loading...

Tuning distributed training algorithms: Hyperopt and Apache Spark MLlib

Databricks Runtime for Machine Learning includes Hyperopt, a library for ML hyperparameter tuning in Python, and Apache Spark MLlib, a library of distributed algorithms for training ML models (also often called "Spark ML"). This example notebook shows how to use them together.

Use case

Distributed machine learning workloads in Python for which you want to tune hyperparameters.

In this example notebook

The demo shows how to tune hyperparameters for an example machine learning workflow in MLlib. You can follow this example to tune other distributed machine learning algorithms from MLlib or from other libraries.

This guide includes two sections to illustrate the process you can follow to develop your own workflow:

  • Run distributed training using MLlib. In this section, you get the MLlib model training code working without hyperparameter tuning.
  • Use Hyperopt to tune hyperparameters in the distributed training workflow. In this section, you wrap the MLlib code with Hyperopt for tuning.

Requirements

This notebooks requires Databricks Runtime for Machine Learning.

MLflow autologging

This notebook demonstrates how to track model training and tuning with MLflow. Starting with MLflow version 1.17.0, you can use MLflow autologging with pyspark.ml. If your cluster is running Databricks Runtime for ML 8.2 or below, you can upgrade the MLflow client to add this pyspark.ml support. Upgrading is not required to run the notebook.

To upgrade MLflow to a version that supports pyspark.ml autologging, uncomment and run the following cell.

    Part 1. Run distributed training using MLlib

    This section shows a simple example of distributed training using MLlib. For more information and examples, see these resources:

    Load data

    This notebook uses the classic MNIST handwritten digit recognition dataset. The examples are vectors of pixels representing images of handwritten digits. For example:

    Image of a digit Image of all 10 digits

    These datasets are stored in the popular LibSVM dataset format. The following cell shows how to load them using MLlib's LibSVM dataset reader utility.

    There are 60000 training images and 10000 test images.

    Display the data. Each image has the true label (the label column) and a vector of features that represent pixel intensities.

      Copied!
       
      label
      features
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      0
      {"vectorType": "sparse", "length": 780, "indices": [93, 94, 95, 96, 98, 99, 100, 120, 121, 122, 123, 124, 127, 128, 129, 130, 148, 149, 150, 151, 152, 155, 156, 157, 158, 159, 175, 176, 177, 178, 179, 180, 183, 184, 185, 186, 187, 188, 203, 204, 205, 206, 207, 212, 213, 214, 215, 216, 217, 231, 232, 233, 234, 235, 241, 242, 243, 244, 245, 246, 258, 259, 260, 261, 262, 270, 271, 272, 273, 274, 275, 285, 286, 287, 288, 289, 298, 299, 300, 301, 302, 303, 313, 314, 315, 316, 327, 328, 329, 330, 331, 332, 341, 342, 343, 344, 356, 357, 358, 359, 360, 369, 370, 371, 372, 385, 386, 387, 388, 397, 398, 399, 400, 413, 414, 415, 416, 425, 426, 427, 428, 440, 441, 442, 443, 444, 453, 454, 455, 456, 466, 467, 468, 469, 470, 471, 472, 481, 482, 483, 484, 485, 492, 493, 494, 495, 496, 497, 498, 499, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634], "values": [3, 106, 151, 2, 59, 134, 63, 5, 137, 253, 253, 47, 173, 243, 71, 20, 49, 253, 253, 253, 156, 13, 253, 253, 212, 28, 7, 145, 253, 253, 237, 69, 10, 209, 253, 253, 233, 40, 124, 253, 253, 253, 139, 7, 184, 253, 253, 198, 64, 157, 253, 253, 163, 12, 55, 229, 253, 253, 233, 62, 61, 235, 253, 192, 40, 72, 253, 253, 253, 235, 19, 11, 238, 253, 235, 45, 30, 226, 253, 253, 253, 70, 13, 253, 253, 205, 46, 228, 253, 253, 243, 10, 71, 253, 253, 205, 136, 253, 253, 253, 128, 134, 254, 254, 85, 207, 254, 254, 132, 133, 253, 253, 96, 206, 253, 253, 132, 117, 253, 253, 205, 63, 235, 253, 253, 115, 13, 253, 253, 205, 10, 49, 205, 253, 253, 232, 9, 13, 253, 253, 240, 81, 6, 79, 195, 253, 253, 253, 253, 116, 3, 171, 253, 253, 239, 194, 96, 73, 73, 189, 195, 199, 253, 253, 253, 253, 253, 225, 17, 47, 253, 253, 253, 253, 253, 253, 253, 253, 254, 253, 253, 253, 253, 253, 184, 30, 4, 171, 253, 253, 253, 253, 253, 253, 253, 254, 253, 253, 253, 253, 225, 47, 15, 126, 253, 253, 253, 253, 253, 253, 254, 253, 253, 232, 144, 45, 3, 40, 225, 253, 253, 253, 195, 17, 63, 11, 9]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [94, 95, 96, 97, 98, 122, 123, 124, 125, 126, 127, 149, 150, 151, 152, 153, 154, 155, 158, 159, 160, 177, 178, 179, 180, 181, 182, 183, 186, 187, 188, 189, 204, 205, 206, 207, 208, 209, 215, 216, 217, 218, 232, 233, 234, 235, 236, 237, 243, 244, 245, 246, 260, 261, 262, 263, 264, 265, 271, 272, 273, 274, 287, 288, 289, 290, 291, 292, 299, 300, 301, 302, 315, 316, 317, 318, 319, 320, 327, 328, 329, 330, 331, 343, 344, 345, 346, 347, 348, 355, 356, 357, 358, 359, 370, 371, 372, 373, 374, 375, 376, 383, 384, 385, 386, 387, 398, 399, 400, 401, 402, 403, 410, 411, 412, 413, 414, 415, 426, 427, 428, 429, 430, 438, 439, 440, 441, 442, 443, 454, 455, 456, 457, 458, 465, 466, 467, 468, 469, 470, 471, 482, 483, 484, 485, 486, 492, 493, 494, 495, 496, 497, 498, 510, 511, 512, 513, 514, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 624, 625, 626, 627, 628, 629, 630, 631], "values": [7, 118, 252, 255, 111, 141, 253, 253, 253, 248, 115, 30, 237, 253, 253, 253, 212, 205, 105, 83, 10, 65, 253, 253, 253, 204, 91, 86, 98, 250, 200, 41, 8, 90, 253, 253, 253, 27, 241, 208, 33, 11, 63, 122, 248, 253, 253, 27, 166, 253, 120, 71, 192, 236, 204, 253, 196, 13, 170, 253, 223, 110, 33, 219, 253, 253, 253, 84, 104, 253, 253, 212, 199, 253, 253, 253, 253, 6, 104, 253, 253, 226, 34, 213, 253, 253, 253, 253, 6, 104, 253, 253, 253, 96, 19, 221, 253, 253, 253, 157, 2, 129, 253, 253, 253, 96, 112, 253, 253, 253, 244, 43, 14, 242, 253, 253, 253, 96, 234, 253, 253, 253, 161, 125, 253, 253, 253, 253, 96, 234, 253, 253, 253, 103, 4, 187, 253, 253, 253, 233, 49, 234, 253, 123, 239, 61, 83, 241, 253, 253, 253, 253, 156, 189, 253, 235, 185, 34, 30, 83, 183, 242, 253, 253, 253, 253, 154, 10, 73, 243, 253, 149, 139, 63, 196, 218, 253, 253, 246, 226, 253, 253, 80, 12, 213, 253, 253, 253, 253, 253, 253, 253, 253, 241, 203, 253, 158, 10, 81, 204, 239, 245, 253, 233, 251, 253, 238, 207, 132, 165, 5, 26, 38, 70, 116, 3, 152, 141, 31]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [95, 96, 97, 123, 124, 125, 126, 152, 153, 154, 155, 156, 180, 181, 182, 183, 184, 185, 186, 187, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 232, 233, 234, 235, 238, 239, 240, 241, 242, 243, 244, 245, 246, 259, 260, 261, 262, 268, 269, 270, 271, 272, 273, 274, 287, 288, 289, 290, 299, 300, 301, 302, 303, 315, 316, 317, 328, 329, 330, 331, 343, 344, 345, 356, 357, 358, 370, 371, 372, 373, 383, 384, 385, 386, 398, 399, 400, 401, 410, 411, 412, 413, 414, 426, 427, 428, 429, 438, 439, 440, 441, 442, 454, 455, 456, 457, 466, 467, 468, 469, 482, 483, 484, 485, 493, 494, 495, 496, 510, 511, 512, 513, 514, 520, 521, 522, 523, 524, 539, 540, 541, 542, 543, 544, 547, 548, 549, 550, 551, 552, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 625, 626, 627, 628, 629, 630, 631, 632, 633], "values": [56, 247, 121, 24, 242, 245, 122, 231, 253, 253, 104, 12, 90, 253, 253, 254, 221, 120, 120, 85, 67, 75, 36, 11, 56, 222, 254, 253, 253, 253, 245, 207, 36, 86, 245, 249, 105, 44, 224, 230, 253, 253, 253, 253, 214, 10, 8, 191, 253, 143, 29, 119, 119, 158, 253, 253, 94, 15, 253, 226, 48, 4, 183, 253, 248, 56, 42, 253, 178, 179, 253, 184, 14, 164, 253, 178, 179, 253, 163, 61, 254, 254, 179, 76, 254, 254, 164, 60, 253, 253, 178, 29, 206, 253, 253, 40, 60, 253, 253, 178, 120, 253, 253, 245, 13, 60, 253, 253, 178, 120, 253, 239, 63, 60, 253, 253, 178, 14, 238, 253, 179, 18, 190, 253, 231, 70, 43, 184, 253, 253, 74, 86, 253, 253, 239, 134, 8, 56, 163, 253, 253, 213, 35, 16, 253, 253, 253, 253, 240, 239, 239, 247, 253, 253, 210, 27, 4, 59, 204, 253, 253, 253, 253, 253, 254, 253, 250, 110, 31, 122, 253, 253, 253, 253, 255, 217, 98]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [96, 97, 98, 99, 100, 101, 102, 103, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 150, 151, 152, 153, 154, 155, 157, 158, 159, 160, 176, 177, 178, 179, 180, 181, 182, 186, 187, 188, 189, 204, 205, 206, 207, 208, 215, 216, 217, 232, 233, 234, 235, 243, 244, 245, 246, 259, 260, 261, 262, 272, 273, 274, 286, 287, 288, 289, 300, 301, 302, 314, 315, 316, 317, 328, 329, 330, 331, 341, 342, 343, 344, 357, 358, 359, 369, 370, 371, 372, 384, 385, 386, 387, 388, 397, 398, 399, 412, 413, 414, 415, 416, 425, 426, 427, 439, 440, 441, 442, 443, 444, 453, 454, 455, 465, 466, 467, 468, 469, 470, 471, 481, 482, 483, 491, 492, 493, 494, 495, 496, 497, 498, 509, 510, 511, 512, 513, 514, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 628, 629, 630, 631, 632, 633], "values": [9, 93, 154, 196, 231, 149, 56, 7, 2, 83, 211, 253, 253, 249, 243, 248, 253, 155, 3, 121, 253, 253, 253, 130, 53, 41, 173, 253, 100, 2, 80, 250, 253, 227, 58, 1, 10, 228, 243, 64, 89, 253, 253, 165, 36, 156, 253, 148, 210, 253, 215, 7, 32, 245, 250, 58, 81, 244, 228, 44, 179, 253, 136, 6, 214, 253, 163, 125, 253, 163, 125, 253, 246, 50, 51, 246, 233, 7, 11, 218, 253, 135, 234, 253, 9, 97, 253, 252, 53, 2, 235, 253, 111, 105, 217, 253, 165, 80, 253, 253, 253, 253, 254, 253, 89, 24, 198, 253, 253, 233, 100, 254, 253, 89, 5, 98, 244, 253, 253, 193, 70, 254, 253, 89, 11, 88, 211, 253, 254, 253, 253, 163, 152, 253, 180, 49, 148, 120, 30, 100, 210, 253, 253, 253, 253, 253, 242, 73, 5, 185, 253, 246, 253, 251, 162, 229, 242, 253, 253, 253, 253, 253, 254, 242, 71, 3, 128, 241, 253, 253, 253, 253, 253, 253, 253, 253, 253, 253, 245, 73, 16, 109, 109, 109, 228, 253, 253, 253, 253, 227, 109, 20, 30, 54, 114, 113, 54, 30]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [96, 97, 98, 99, 123, 124, 125, 126, 127, 151, 152, 153, 154, 155, 177, 178, 179, 180, 181, 182, 183, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 259, 260, 261, 262, 263, 264, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 287, 288, 289, 290, 291, 299, 300, 301, 302, 303, 314, 315, 316, 317, 318, 319, 328, 329, 330, 331, 342, 343, 344, 345, 346, 357, 358, 359, 360, 369, 370, 371, 372, 373, 385, 386, 387, 388, 397, 398, 399, 400, 401, 413, 414, 415, 416, 425, 426, 427, 428, 441, 442, 443, 444, 453, 454, 455, 456, 469, 470, 471, 472, 481, 482, 483, 484, 497, 498, 499, 500, 509, 510, 511, 512, 513, 524, 525, 526, 527, 528, 537, 538, 539, 540, 541, 542, 551, 552, 553, 554, 555, 556, 565, 566, 567, 568, 569, 570, 571, 572, 577, 578, 579, 580, 581, 582, 583, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637], "values": [39, 228, 153, 32, 40, 227, 253, 253, 192, 209, 253, 253, 167, 43, 41, 151, 250, 253, 186, 5, 54, 148, 253, 253, 223, 226, 228, 241, 228, 228, 228, 228, 137, 138, 250, 253, 253, 128, 129, 232, 253, 253, 253, 253, 253, 250, 137, 38, 241, 253, 253, 137, 7, 30, 38, 38, 62, 168, 179, 253, 250, 141, 39, 151, 253, 253, 253, 38, 6, 173, 253, 253, 117, 95, 251, 253, 253, 143, 8, 46, 224, 253, 220, 151, 253, 253, 177, 8, 117, 253, 252, 128, 29, 249, 253, 253, 126, 20, 253, 253, 253, 151, 253, 253, 232, 20, 20, 253, 253, 253, 255, 253, 253, 59, 20, 253, 253, 253, 254, 253, 253, 19, 20, 253, 253, 253, 254, 253, 253, 19, 20, 253, 253, 236, 255, 253, 253, 110, 3, 3, 112, 253, 253, 123, 244, 253, 253, 253, 110, 3, 3, 112, 253, 253, 239, 51, 54, 243, 253, 253, 253, 157, 33, 10, 10, 34, 158, 253, 253, 246, 111, 115, 250, 253, 253, 253, 253, 217, 137, 137, 137, 137, 218, 253, 253, 253, 239, 51, 54, 244, 253, 253, 253, 253, 253, 253, 253, 253, 253, 253, 253, 172, 51]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [97, 98, 99, 100, 101, 102, 125, 126, 127, 128, 129, 130, 153, 154, 155, 156, 157, 158, 159, 181, 182, 183, 184, 185, 186, 187, 188, 209, 210, 211, 212, 213, 214, 215, 216, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 262, 263, 264, 265, 266, 267, 269, 270, 271, 272, 273, 289, 290, 291, 292, 293, 294, 298, 299, 300, 301, 302, 316, 317, 318, 319, 320, 326, 327, 328, 329, 330, 343, 344, 345, 346, 347, 348, 354, 355, 356, 357, 358, 370, 371, 372, 373, 374, 375, 382, 383, 384, 385, 386, 398, 399, 400, 401, 402, 403, 410, 411, 412, 413, 414, 426, 427, 428, 429, 430, 437, 438, 439, 440, 441, 454, 455, 456, 457, 458, 465, 466, 467, 468, 482, 483, 484, 485, 486, 487, 492, 493, 494, 495, 496, 510, 511, 512, 513, 514, 515, 516, 518, 519, 520, 521, 522, 523, 524, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 625, 626, 627, 628, 629, 630, 631, 632, 633], "values": [254, 252, 252, 90, 51, 31, 252, 250, 250, 250, 252, 149, 130, 250, 250, 250, 252, 210, 60, 10, 130, 250, 250, 252, 250, 221, 40, 92, 252, 252, 212, 254, 252, 252, 252, 62, 102, 252, 250, 189, 29, 171, 250, 250, 250, 82, 62, 211, 250, 252, 189, 40, 20, 160, 250, 250, 202, 203, 221, 250, 250, 212, 29, 102, 250, 250, 243, 121, 41, 254, 252, 252, 212, 102, 252, 252, 254, 150, 62, 221, 252, 250, 189, 29, 102, 250, 250, 252, 149, 62, 211, 250, 252, 250, 100, 102, 250, 250, 252, 149, 102, 250, 250, 252, 169, 20, 102, 250, 250, 212, 29, 103, 252, 252, 244, 121, 92, 252, 252, 252, 163, 102, 250, 250, 222, 61, 252, 250, 250, 250, 102, 250, 250, 252, 210, 60, 123, 252, 250, 250, 250, 102, 250, 250, 252, 250, 221, 40, 82, 202, 241, 252, 250, 250, 250, 82, 202, 243, 255, 252, 252, 252, 254, 252, 252, 252, 254, 232, 202, 40, 121, 252, 250, 250, 250, 252, 250, 250, 250, 252, 149, 252, 250, 250, 250, 252, 250, 250, 250, 222, 60, 49, 130, 250, 250, 252, 250, 250, 88, 40]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [98, 99, 100, 101, 125, 126, 127, 128, 129, 130, 152, 153, 154, 155, 156, 157, 158, 180, 181, 182, 183, 184, 185, 186, 187, 207, 208, 209, 210, 212, 213, 214, 215, 234, 235, 236, 237, 241, 242, 243, 262, 263, 264, 265, 269, 270, 271, 289, 290, 291, 292, 297, 298, 299, 317, 318, 319, 320, 325, 326, 327, 328, 344, 345, 346, 347, 353, 354, 355, 356, 372, 373, 374, 381, 382, 383, 384, 400, 401, 402, 409, 410, 411, 412, 427, 428, 429, 430, 437, 438, 439, 455, 456, 457, 458, 464, 465, 466, 467, 483, 484, 485, 486, 491, 492, 493, 494, 495, 511, 512, 513, 519, 520, 521, 522, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 624, 625, 626, 627, 628, 629, 630, 631], "values": [50, 237, 203, 75, 37, 232, 254, 254, 244, 15, 9, 156, 254, 209, 250, 254, 131, 31, 233, 163, 11, 143, 254, 233, 25, 9, 164, 192, 11, 74, 253, 254, 89, 3, 122, 254, 195, 95, 254, 89, 9, 254, 231, 44, 127, 254, 146, 11, 192, 254, 95, 65, 254, 207, 127, 254, 220, 35, 12, 254, 237, 45, 21, 233, 253, 86, 12, 255, 254, 71, 181, 254, 249, 12, 254, 254, 71, 208, 254, 148, 113, 254, 237, 45, 119, 250, 254, 129, 183, 254, 207, 190, 254, 254, 13, 112, 254, 254, 91, 190, 254, 192, 5, 12, 185, 254, 251, 78, 190, 254, 147, 173, 254, 252, 128, 190, 254, 225, 40, 66, 25, 16, 102, 243, 254, 212, 96, 254, 254, 230, 254, 221, 214, 254, 254, 217, 28, 10, 167, 254, 254, 254, 254, 254, 254, 217, 86, 6, 67, 194, 254, 254, 227, 100, 11]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [98, 99, 100, 101, 126, 127, 128, 129, 130, 154, 155, 156, 157, 158, 159, 182, 183, 184, 185, 186, 187, 188, 208, 209, 210, 211, 212, 213, 214, 215, 216, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 261, 262, 263, 264, 269, 270, 271, 272, 273, 288, 289, 290, 291, 292, 297, 298, 299, 300, 301, 316, 317, 318, 319, 320, 326, 327, 328, 329, 344, 345, 346, 347, 348, 355, 356, 357, 358, 371, 372, 373, 374, 375, 383, 384, 385, 386, 399, 400, 401, 402, 411, 412, 413, 414, 427, 428, 429, 430, 438, 439, 440, 441, 442, 455, 456, 457, 458, 466, 467, 468, 469, 483, 484, 485, 486, 493, 494, 495, 496, 497, 511, 512, 513, 514, 520, 521, 522, 523, 524, 525, 539, 540, 541, 542, 547, 548, 549, 550, 551, 552, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 625, 626, 627, 628, 629, 630, 631, 632], "values": [70, 255, 165, 114, 122, 253, 253, 253, 120, 165, 253, 253, 253, 234, 52, 99, 253, 253, 253, 253, 228, 26, 60, 168, 238, 202, 174, 253, 253, 253, 127, 91, 81, 1, 215, 128, 28, 12, 181, 253, 253, 175, 3, 18, 204, 253, 77, 7, 253, 253, 253, 54, 54, 248, 253, 253, 143, 1, 127, 253, 253, 188, 104, 253, 253, 253, 20, 81, 249, 253, 191, 192, 253, 253, 218, 5, 203, 253, 208, 21, 56, 237, 253, 250, 100, 104, 253, 253, 75, 76, 253, 253, 224, 119, 253, 253, 75, 80, 253, 253, 103, 4, 241, 253, 218, 32, 213, 253, 253, 103, 125, 253, 253, 191, 213, 253, 253, 103, 3, 176, 253, 253, 135, 213, 253, 253, 103, 9, 162, 253, 253, 226, 37, 179, 253, 253, 135, 46, 157, 253, 253, 253, 63, 23, 188, 253, 249, 179, 179, 179, 179, 233, 253, 253, 233, 156, 10, 51, 235, 253, 253, 253, 253, 253, 253, 251, 232, 120, 16, 124, 253, 253, 253, 253, 152, 104]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [99, 100, 101, 102, 103, 126, 127, 128, 129, 130, 131, 152, 153, 154, 155, 156, 157, 158, 159, 160, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 260, 261, 262, 263, 264, 265, 270, 271, 272, 273, 286, 287, 288, 289, 290, 291, 292, 298, 299, 300, 301, 314, 315, 316, 317, 318, 319, 327, 328, 329, 330, 342, 343, 344, 345, 346, 347, 355, 356, 357, 358, 370, 371, 372, 373, 374, 383, 384, 385, 386, 397, 398, 399, 400, 401, 411, 412, 413, 414, 425, 426, 427, 428, 429, 439, 440, 441, 442, 453, 454, 455, 456, 467, 468, 469, 470, 481, 482, 483, 484, 494, 495, 496, 497, 498, 509, 510, 511, 512, 521, 522, 523, 524, 525, 526, 537, 538, 539, 540, 541, 543, 544, 547, 548, 549, 550, 551, 552, 553, 554, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635], "values": [25, 114, 254, 254, 74, 123, 220, 253, 253, 253, 74, 12, 104, 254, 253, 253, 253, 253, 190, 78, 43, 222, 253, 254, 253, 253, 253, 253, 253, 182, 7, 40, 227, 253, 253, 254, 235, 227, 253, 253, 253, 253, 14, 72, 92, 219, 253, 253, 235, 74, 57, 48, 74, 127, 253, 253, 32, 224, 253, 253, 253, 229, 49, 75, 253, 253, 163, 6, 161, 251, 253, 253, 221, 46, 4, 183, 253, 163, 15, 253, 253, 253, 253, 49, 179, 253, 179, 11, 104, 253, 253, 253, 173, 6, 179, 253, 253, 59, 165, 254, 254, 242, 71, 181, 254, 255, 121, 50, 185, 253, 253, 187, 179, 253, 253, 208, 209, 253, 253, 204, 26, 179, 253, 253, 208, 209, 253, 253, 178, 179, 253, 253, 208, 209, 253, 253, 46, 23, 201, 253, 253, 208, 209, 253, 253, 47, 15, 97, 253, 253, 253, 208, 209, 253, 253, 218, 120, 48, 32, 32, 134, 191, 253, 253, 253, 253, 128, 47, 222, 253, 253, 251, 239, 244, 243, 239, 241, 243, 253, 253, 253, 253, 253, 168, 38, 138, 253, 253, 253, 253, 253, 253, 253, 254, 253, 253, 253, 253, 221, 208, 12, 7, 104, 147, 253, 253, 253, 253, 253, 255, 253, 253, 182, 104, 31]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [99, 100, 101, 127, 128, 129, 130, 155, 156, 157, 158, 159, 182, 183, 184, 185, 186, 187, 208, 209, 210, 211, 213, 214, 215, 216, 235, 236, 237, 238, 239, 241, 242, 243, 244, 262, 263, 264, 265, 266, 269, 270, 271, 272, 290, 291, 292, 293, 294, 298, 299, 300, 318, 319, 320, 321, 326, 327, 328, 345, 346, 347, 348, 354, 355, 356, 373, 374, 375, 376, 382, 383, 384, 401, 402, 403, 404, 410, 411, 412, 429, 430, 431, 432, 437, 438, 439, 440, 457, 458, 459, 460, 465, 466, 467, 468, 485, 486, 487, 488, 492, 493, 494, 495, 513, 514, 515, 516, 519, 520, 521, 522, 523, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 569, 570, 571, 572, 573, 574, 575, 576, 577, 598, 599, 600, 601, 602, 603, 604, 627, 628, 629, 630, 631], "values": [147, 255, 155, 210, 254, 253, 103, 96, 156, 254, 245, 57, 7, 19, 10, 206, 254, 141, 7, 127, 210, 63, 55, 254, 240, 57, 3, 172, 254, 217, 12, 5, 254, 254, 111, 3, 173, 254, 254, 59, 2, 196, 254, 111, 9, 254, 254, 115, 1, 169, 254, 111, 113, 254, 237, 14, 169, 254, 111, 20, 246, 254, 135, 154, 254, 111, 34, 254, 254, 51, 81, 254, 111, 112, 254, 254, 4, 149, 254, 105, 144, 254, 254, 4, 1, 171, 254, 22, 201, 254, 254, 4, 33, 254, 180, 8, 177, 254, 254, 4, 59, 224, 254, 74, 112, 254, 254, 135, 87, 229, 254, 189, 4, 35, 254, 254, 244, 44, 52, 232, 254, 237, 42, 13, 214, 254, 254, 243, 248, 254, 242, 81, 31, 245, 254, 254, 254, 252, 76, 26, 188, 254, 238, 82]}
      0
      {"vectorType": "sparse", "length": 780, "indices": [101, 102, 103, 104, 129, 130, 131, 132, 157, 158, 159, 160, 183, 184, 185, 186, 187, 188, 211, 212, 213, 214, 215, 216, 237, 238, 239, 240, 241, 242, 243, 244, 264, 265, 266, 267, 268, 269, 270, 271, 272, 291, 292, 293, 294, 295, 297, 298, 299, 300, 319, 320, 321, 322, 325, 326, 327, 328, 346, 347, 348, 349, 350, 353, 354, 355, 356, 373, 374, 375, 376, 377, 381, 382, 383, 384, 400, 401, 402, 403, 404, 408, 409, 410, 411, 412, 428, 429, 430, 431, 435, 436, 437, 438, 439, 455, 456, 457, 458, 463, 464, 465, 466, 483, 484, 485, 486, 489, 490, 491, 492, 493, 494, 510, 511, 512, 513, 515, 516, 517, 518, 519, 520, 521, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 593, 594, 595, 596, 597, 598, 599, 600, 601, 621, 622, 623, 624, 625, 626, 627], "values": [24, 242, 181, 5, 174, 254, 231, 39, 206, 254, 204, 28, 10, 111, 248, 254, 207, 4, 172, 254, 254, 254, 236, 9, 5, 168, 247, 249, 180, 254, 254, 48, 5, 152, 254, 248, 91, 63, 254, 254, 96, 5, 126, 254, 237, 146, 55, 252, 254, 96, 145, 254, 254, 54, 51, 251, 254, 84, 42, 246, 254, 161, 7, 77, 254, 254, 12, 6, 169, 254, 232, 34, 161, 254, 242, 10, 5, 102, 254, 230, 77, 65, 243, 254, 174, 4, 81, 235, 247, 78, 12, 204, 254, 200, 35, 12, 200, 254, 146, 148, 254, 254, 77, 125, 254, 238, 59, 48, 190, 251, 254, 123, 10, 88, 243, 248, 56, 10, 149, 245, 254, 254, 102, 9, 184, 254, 155, 9, 110, 225, 254, 254, 217, 66, 2, 13, 251, 254, 182, 225, 254, 254, 254, 148, 18, 94, 254, 254, 254, 254, 254, 178, 51, 2, 69, 254, 254, 219, 107, 22, 1]}
      713 rows|Truncated data

      Create a function to train a model

      In this section, you define a function to train a decision tree. Wrapping the training code in a function is important for passing the function to Hyperopt for tuning later.

      Details: The tree algorithm needs to know that the labels are categories 0-9, rather than continuous values. This example uses the StringIndexer class to do this. A Pipeline ties this feature preprocessing together with the tree algorithm. ML Pipelines are tools Spark provides for piecing together Machine Learning algorithms into workflows. To learn more about Pipelines, check out other ML example notebooks in Databricks and the ML Pipelines user guide.

      2021/06/29 22:55:36 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of pyspark.ml. If you encounter errors during autologging, try upgrading / downgrading pyspark.ml to a supported version, or try upgrading MLflow.

      Run the training function to make sure it works. It's a good idea to make sure training code runs before adding in tuning.

      The trained decision tree achieved an F1 score of 0.6703580220505504 on the validation data

      Part 2. Use Hyperopt to tune hyperparameters

      In this section, you create the Hyperopt workflow.

      • Define a function to minimize
      • Define a search space over hyperparameters
      • Specify the search algorithm and use fmin() to tune the model

      For more information about the Hyperopt APIs, see the Hyperopt documentation.

      Define a function to minimize

      • Input: hyperparameters
      • Internally: Reuse the training function defined above.
      • Output: loss

      Define the search space over hyperparameters

      This example tunes two hyperparameters: minInstancesPerNode and maxBins. See the Hyperopt documentation for details on defining a search space and parameter expressions.

      Tune the model using Hyperopt fmin()

      • Set max_evals to the maximum number of points in hyperparameter space to test (the maximum number of models to fit and evaluate). Because this command evaluates many models, it can take several minutes to execute.
      • You must also specify which search algorithm to use. The two main choices are:
        • hyperopt.tpe.suggest: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on previous results
        • hyperopt.rand.suggest: Random search, a non-adaptive approach that randomly samples the search space

      Important:
      When using Hyperopt with MLlib and other distributed training algorithms, do not pass a trials argument to fmin(). When you do not include the trials argument, Hyperopt uses the default Trials class, which runs on the cluster driver. Hyperopt needs to evaluate each trial on the driver node so that each trial can initiate distributed training jobs.

      Do not use the SparkTrials class with MLlib. SparkTrials is designed to distribute trials for algorithms that are not themselves distributed. MLlib uses distributed computing already and is not compatible with SparkTrials.

      100%|██████████| 8/8 [03:14<00:00, 24.26s/trial, best loss: -0.6827359145872123]

      Out[12]: {'maxBins': 7.438805606211824, 'minInstancesPerNode': 49.28366143367587}

      Retrain the model on the full training dataset

      For tuning, this workflow split the training dataset into training and validation subsets. Now, retrain the model using the "best" hyperparameters on the full training dataset.

      Use the test dataset to compare evaluation metrics for the initial and "best" models.

      On the test data, the initial (untuned) model achieved F1 score 0.6777770408635782, and the final (tuned) model achieved 0.6978177456822743.