Deploy a Hugging Face `transformers` model with Model Serving

This notebook demonstrates how to deploy a model logged using the Hugging Face transformers MLflow flavor to a serving endpoint. This example specifically deploys a GPT-2 model to a GPU endpoint, the workflow outlined here can be adapted for deploying other types of models to either CPU or GPU endpoints.

Install and import libraries

Initialize and configure your model

Define and configure your model using any popular ML framework.

Log your model using MLflow

The following code defines inference parameters to pass to the model at the time of inference and defines the schema for the model, before logging the model with the MLflow Hugging Face transformers flavor.

Test your model in a notebook

In the following command, you load the model, so you can generate a prediction with the given parameters.

Configure and create your model serving endpoint

The following variables set the values for configuring the model serving endpoint, such as the endpoint name, compute type, and which model to serve with the endpoint. After you call the create endpoint API, the logged model is deployed to the endpoint.

View your endpoint

For more information about your endpoint, go to the Serving UI and search for your endpoint name.

Query your endpoint

Once your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.

deploy-transformers-model-serving-azure (1)(Python)

Deploy a Hugging Face transformers model with Model Serving