mlflow-mleap-deployment(Python)

Loading...

MLflow: Deploying PySpark models saved as MLeap to SageMaker

NOTE: Databricks Runtime does not support open source MLeap. To use MLeap, you must create a cluster running Databricks Runtime 13.3 LTS ML or below. These versions of Databricks Runtime ML have a custom version of MLeap preinstalled.

This notebook is part 2 of the MLflow MLeap example. The first part, MLflow Deployment: Train PySpark Model and Log in MLeap Format, focuses on training a PySpark model and logs the training metrics, parameters, and model in MLeap format to the MLflow tracking server.

Note: We do not recommend using Run All because it takes several minutes to deploy and update models in SageMaker; models cannot be queried until they are active.

The notebook contains the following sections:

Setup

  • Launch a Python 3 cluster configured with an IAM role for SageMaker deployment
  • Install the MLeap Scala libraries
  • Install MLflow and boto3.

Deploy the model to SageMaker

  • Specify a Docker image URI for deployment
  • Use MLflow to deploy the model to SageMaker
  • Check the status of the deployed model
    • Determine if the deployed model is active and ready to be queried

Query the deployed model

  • Construct a query using test data
  • Evaluate the query using the deployed model

Clean up the deployment

  • Delete the model deployment using the MLflow API
  • Confirm that the deployment was terminated

Setup

Create a cluster and install MLflow and MLeap on your cluster

  1. Create a GPU-enabled cluster with the following:
    • Python Version: Python 3
    • An attached IAM role that supports SageMaker deployment. For information about setting up a cluster IAM role for SageMaker deployment, see the SageMaker deployment guide.
  2. Install required libraries.
    1. Create library with Source Maven Coordinate and the fully-qualified Maven artifact coordinate:
      • ml.combust.mleap:mleap-spark_2.11:0.13.0
    2. Install the libraries into the cluster.
  3. If you are running Databricks Runtime, run Cmd 4 to install mlflow. If you are using Databricks Runtime ML, you can skip this step as the required libraries are already installed.
  4. Attach this notebook to the cluster.

Load pipeline training data

Load data that will be used to train the PySpark Pipeline model. This model uses the 20 Newsgroups dataset which consists of articles from 20 Usenet newsgroups.

Specify the run ID associated with an PySpark training run from part 1. You can find a run ID and model path from the experiment run, which can be found on the run details page:

image

Set region, run ID, model URI

Note: You must create a new SageMaker endpoint for each new region.

Deploy the model to SageMaker

Specify a Docker image in Amazon's Elastic Container Registry (ECR) that will be used by SageMaker to serve the model. There are two ways to obtain the container URL:

  • [Option 1] You can build your own mlflow-pyfunc image and upload it to an ECR repository using the MLflow CLI: mlflow sagemaker build-and-push-container.
  • [Option 2] Contact your Databricks representative for an mlflow-pyfunc image URL in ECR.

Define the ECR URL for the mlflow-pyfunc image that will be passed as an argument to MLflow's deploy function.

Use MLflow to deploy the model to SageMaker

Using MLflow's SageMaker deployment API, deploy the trained model to SageMaker.

Check the status of the deployed model

Check the status of the new SageMaker endpoint using a simple function.

Note: The application status should be Creating. Wait until the status is InService before continuing; until then, query requests will fail.

Query the deployed model

Construct a query using test data

Load data from the 20 Newsgroups dataset and construct a query DataFrame for the deployed model to evaluate.

Evaluate the query using the deployed model

Transform the query dataframe into JSON format and evaluate it by posting the JSON to the deployed model.

Note: Deployed MLeap models only process JSON-serialized Pandas dataframes with the split orientation. You can convert a Spark DataFrame to this format as follows:

model_input_json = spark_dataframe.toPandas().to_json(orient='split')

Clean up the deployment

Finally, terminate the deployment using MLflow and confirm that the deployment has been terminated.

Delete the deployment using MLflow

    Confirm that the deployment was terminated

    By executing the following function, you should see that the SageMaker endpoints associated with the application have been removed.