Deploy a Hugging Face transformers
model with Model Serving
This notebook demonstrates how to deploy a model logged using the Hugging Face transformers
MLflow flavor to a serving endpoint. This example specifically deploys a GPT-2 model to a GPU endpoint, the workflow outlined here can be adapted for deploying other types of models to either CPU or GPU endpoints.
Install and import libraries
Initialize and configure your model
Define and configure your model using any popular ML framework.
Log your model using MLflow
The following code defines inference parameters to pass to the model at the time of inference and defines the schema for the model, before logging the model with the MLflow Hugging Face transformers
flavor.
Test your model in a notebook
In the following command, you load the model, so you can generate a prediction with the given parameters.
Configure and create your model serving endpoint
The following variables set the values for configuring the model serving endpoint, such as the endpoint name, compute type, and which model to serve with the endpoint. After you call the create endpoint API, the logged model is deployed to the endpoint.
View your endpoint
For more information about your endpoint, go to the Serving UI and search for your endpoint name.
Query your endpoint
Once your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.