# Convert the pandas dataframe to spark spark_df = spark.createDataFrame(df) with mlflow.start_run() as run: # Create the vector assembler assembler = VectorAssembler(inputCols=[X1,X2], outputCol="features") # Create the linear regression lr = LinearRegression(featuresCol="features", labelCol=Y) # Put the vector assembler and the linear regression into a pipeline pipeline = Pipeline(stages=[assembler,lr]) # Train the pipeline model = pipeline.fit(spark_df) mlflow.spark.log_model(model, "model", registered_model_name="spark_linear_regression")
2023/04/28 18:58:38 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().
Successfully registered model 'spark_linear_regression'.
2023/04/28 18:59:24 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: spark_linear_regression, version 1
Created version '1' of model 'spark_linear_regression'.
Serve a SparkML model
This notebook trains a SparkML Pipeline and logs to MLflow for use in Model Serving (AWS | Azure).
Requirements