Train Spark ML models on Databricks Connect with pyspark.ml.connect
Preview
This feature is in Public Preview.
This article provides an example that demonstrates how to use the pyspark.ml.connect
module to perform distributed training to train Spark ML models and run model inference on Databricks Connect.
What is pyspark.ml.connect
?
Spark 3.5 introduces pyspark.ml.connect
which is designed for supporting Spark connect mode and Databricks Connect. Learn more about Databricks Connect.
The pyspark.ml.connect
module consists of common learning algorithms and utilities, including classification, feature transformers, ML pipelines, and cross validation. This module provides similar interfaces to the legacy `pyspark.ml` module, but the pyspark.ml.connect
module currently only contains a subset of the algorithms in pyspark.ml
. The supported algorithms are listed below:
Classification algorithm:
pyspark.ml.connect.classification.LogisticRegression
Feature transformers:
pyspark.ml.connect.feature.MaxAbsScaler
andpyspark.ml.connect.feature.StandardScaler
Evaluator:
pyspark.ml.connect.RegressionEvaluator
,pyspark.ml.connect.BinaryClassificationEvaluator
andMulticlassClassificationEvaluator
Pipeline:
pyspark.ml.connect.pipeline.Pipeline
Model tuning:
pyspark.ml.connect.tuning.CrossValidator
Requirements
Set up Databricks Connect on your clusters. See Compute configuration for Databricks Connect.
Databricks Runtime 14.0 ML or higher installed.
Cluster access mode of
Assigned
.
Example notebook
The following notebook demonstrates how to use Distributed ML on Databricks Connect:
For reference information about APIs in pyspark.ml.connect
, Databricks recommends the Apache Spark API reference