Mosaic AI Model Serving concepts

This page provides definitions of key concepts that are used in Mosaic AI Model Serving for model deployments.

Endpoint

REST API that exposes one or more served models for inference.

Route-optimized endpoint

Endpoint property that enables an improved network path with faster, more direct communication between the user and the model during inference. For more information, see Route optimization on serving endpoints.

Provisioned concurrency

Endpoint property that specifies the maximum number of parallel requests an endpoint can handle. Estimate the required concurrency using the formula: provisioned concurrency = queries per second (QPS) \* model execution time (s).

Scale to zero

Endpoint property that automatically reduces resource consumption to zero when endpoint is not in-use. Scale to zero is recommended for testing and development. However, scale to zero is not recommended for production endpoints, as latency is greater and capacity is not guaranteed when scaled to zero.

Served entity

Named deployment unit inside an endpoint that represents a specific model with its compute configuration that can receive routed traffic.

Traffic configuration

Specification for what percentage of traffic to an endpoint should go to each model. Traffic configuration is required for endpoints with more than one served model.

The following is an example where the endpoint named multi-pt-model hosts version 2 of meta_llama_v3_1_70b_instruct which gets 60% of the endpoint traffic, and also hosts version 3 of meta_llama_v3_1_8b_instruct which gets 40% of the endpoint traffic. For more information, see Serve multiple models to a model serving endpoint.

Bash

POST /api/2.0/serving-endpoints
{
   "name":"multi-pt-model"
   "config":
   {
      "served_entities":
      [
         {
            "name":"meta_llama_v3_1_70b_instruct",
            "entity_name":"system.ai.meta_llama_v3_1_70b_instruct",
            "entity_version":"4",
            "min_provisioned_throughput":0,
            "max_provisioned_throughput":2400
         },
         {
            "name":"meta_llama_v3_1_8b_instruct",
            "entity_name":"system.ai.meta_llama_v3_1_8b_instruct",
            "entity_version":"4",
            "min_provisioned_throughput":0,
            "max_provisioned_throughput":1240
         }
      ],
      "traffic_config":
      {
         "routes":
         [
            {
               "served_model_name":"meta_llama_v3_1_8b_instruct",
               "traffic_percentage":"60"
            },
            {
               "served_model_name":"meta_llama_v3_1_70b_instruct",
               "traffic_percentage":"40"
            }
         ]
      }
   }
}

Endpoint​

Route-optimized endpoint​

Provisioned concurrency​

Scale to zero​

Served entity​

Traffic configuration​