provisioned-throughput-mistral-serving(Python)

Loading...

Provisioned Throughput Mistral serving example

Provisioned Throughput provides optimized inference for Foundation Models with performance guarantees for production workloads. Currently, Databricks supports optimizations for Llama2, Mosaic MPT, and Mistral class of models.

This example walks through:

  1. Downloading the model from Hugging Face transformers
  2. Logging the model in a provisioned throughput supported format into the Databricks Unity Catalog or Workspace Registry
  3. Enabling provisioned throughput on the model

Prerequisites

  • Attach a cluster with sufficient memory to the notebook
  • Make sure to have MLflow version 2.11 or later installed
  • Make sure to enable Models in UC, especially when working with models larger than 7B in size

Step 1: Log the model for optimized LLM serving

Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages. Requirement already satisfied: mlflow in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (2.11.3) Requirement already satisfied: markdown<4,>=3.3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.4.1) Requirement already satisfied: gunicorn<22 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (20.1.0) Requirement already satisfied: graphene<4 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from mlflow) (3.3) Requirement already satisfied: numpy<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.23.5) Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (0.4) Requirement already satisfied: pytz<2025 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2022.7) Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (4.24.0) Requirement already satisfied: packaging<24 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (23.2) Requirement already satisfied: cloudpickle<4 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2.0.0) Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.10.0) Requirement already satisfied: pandas<3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.5.3) Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (8.0.4) Requirement already satisfied: matplotlib<4 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.7.0) Requirement already satisfied: sqlalchemy<3,>=1.4.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.4.39) Requirement already satisfied: importlib-metadata!=4.7.0,<8,>=3.7.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (4.11.3) Requirement already satisfied: alembic!=1.10.0,<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from mlflow) (1.13.1) Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (0.4.2) Requirement already satisfied: Flask<4 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2.2.5) Requirement already satisfied: querystring-parser<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from mlflow) (1.2.4) Requirement already satisfied: gitpython<4,>=3.1.9 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.1.27) Requirement already satisfied: requests<3,>=2.17.3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2.28.1) Requirement already satisfied: docker<8,>=4.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from mlflow) (7.0.0) Requirement already satisfied: scikit-learn<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.1.1) Requirement already satisfied: pyarrow<16,>=4.0.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (8.0.0) Requirement already satisfied: pyyaml<7,>=5.1 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (6.0) Requirement already satisfied: Jinja2<4,>=2.11 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.1.2) Requirement already satisfied: typing-extensions>=4 in /databricks/python3/lib/python3.10/site-packages (from alembic!=1.10.0,<2->mlflow) (4.4.0) Requirement already satisfied: Mako in /databricks/python3/lib/python3.10/site-packages (from alembic!=1.10.0,<2->mlflow) (1.2.0) Requirement already satisfied: urllib3>=1.26.0 in /databricks/python3/lib/python3.10/site-packages (from docker<8,>=4.0.0->mlflow) (1.26.14) Requirement already satisfied: Werkzeug>=2.2.2 in /databricks/python3/lib/python3.10/site-packages (from Flask<4->mlflow) (2.2.2) Requirement already satisfied: itsdangerous>=2.0 in /databricks/python3/lib/python3.10/site-packages (from Flask<4->mlflow) (2.0.1) Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.10/site-packages (from gitpython<4,>=3.1.9->mlflow) (4.0.11) Requirement already satisfied: graphql-core<3.3,>=3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from graphene<4->mlflow) (3.2.3) Requirement already satisfied: graphql-relay<3.3,>=3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from graphene<4->mlflow) (3.2.0) Requirement already satisfied: aniso8601<10,>=8 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages (from graphene<4->mlflow) (9.0.1) Requirement already satisfied: setuptools>=3.0 in /databricks/python3/lib/python3.10/site-packages (from gunicorn<22->mlflow) (65.6.3) Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.10/site-packages (from importlib-metadata!=4.7.0,<8,>=3.7.0->mlflow) (3.11.0) Requirement already satisfied: MarkupSafe>=2.0 in /databricks/python3/lib/python3.10/site-packages (from Jinja2<4,>=2.11->mlflow) (2.1.1) Requirement already satisfied: pyparsing>=2.3.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (1.4.4) Requirement already satisfied: fonttools>=4.22.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (4.25.0) Requirement already satisfied: cycler>=0.10 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (0.11.0) Requirement already satisfied: pillow>=6.2.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (9.4.0) Requirement already satisfied: contourpy>=1.0.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (1.0.5) Requirement already satisfied: python-dateutil>=2.7 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (2.8.2) Requirement already satisfied: six in /usr/lib/python3/dist-packages (from querystring-parser<2->mlflow) (1.16.0) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.10/site-packages (from requests<3,>=2.17.3->mlflow) (2022.12.7) Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.10/site-packages (from requests<3,>=2.17.3->mlflow) (3.4) Requirement already satisfied: charset-normalizer<3,>=2 in /databricks/python3/lib/python3.10/site-packages (from requests<3,>=2.17.3->mlflow) (2.0.4) Requirement already satisfied: joblib>=1.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow) (2.2.0) Requirement already satisfied: greenlet!=0.4.17 in /databricks/python3/lib/python3.10/site-packages (from sqlalchemy<3,>=1.4.0->mlflow) (2.0.1) Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=3.1.9->mlflow) (5.0.0)
Cancelled

config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
model.safetensors.index.json: 0%| | 0.00/25.1k [00:00<?, ?B/s]
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
model-00001-of-00002.safetensors: 0%| | 0.00/9.94G [00:00<?, ?B/s]
model-00002-of-00002.safetensors: 0%| | 0.00/4.54G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/116 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.47k [00:00<?, ?B/s]
tokenizer.model: 0%| | 0.00/493k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/1.80M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/72.0 [00:00<?, ?B/s]

2024/03/26 05:26:41 INFO mlflow.types.utils: MLflow 2.9.0 introduces model signature with new data types for lists and dictionaries. For input such as Dict[str, Union[scalars, List, Dict]], we infer dictionary values types as `List -> Array` and `Dict -> Object`. /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages/mlflow/types/utils.py:393: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details. warnings.warn( 2024/03/26 05:26:41 INFO mlflow.types.utils: MLflow 2.9.0 introduces model signature with new data types for lists and dictionaries. For input such as Dict[str, Union[scalars, List, Dict]], we infer dictionary values types as `List -> Array` and `Dict -> Object`.

To enable optimized serving, when logging the model, include the extra metadata dictionary when calling mlflow.transformers.log_model as shown below:

metadata = {"task": "llm/v1/chat"}

This specifies the API signature used for the model serving endpoint.

/root/.ipykernel/2367/command-58238802225044-113237141:16: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` - ``4.37.2``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range. mlflow.transformers.log_model( /local_disk0/.ephemeral_nfs/envs/pythonEnv-d077024b-0139-4912-9a28-1f5db0e571a1/lib/python3.10/site-packages/mlflow/models/model.py:619: FutureWarning: The 'transformers' MLflow Models integration is known to be compatible with the following package version ranges: ``4.25.1`` - ``4.37.2``. MLflow Models integrations with transformers may not succeed when used with package versions outside of this range. flavor.save_model(path=local_path, mlflow_model=mlflow_model, **kwargs)
README.md: 0%| | 0.00/3.90k [00:00<?, ?B/s]
2024/03/26 05:27:38 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /local_disk0/repl_tmp_data/ReplId-60360-79330-65a35-e/tmpox7enc36/model, flavor: transformers). Fall back to return ['transformers==4.39.1', 'torch==2.0.1', 'torchvision==0.15.2', 'accelerate==0.28.0']. Set logging level to DEBUG to see the full traceback.
Uploading artifacts: 0%| | 0/56 [00:00<?, ?it/s]
2024/03/26 05:27:39 INFO mlflow.store.artifact.cloud_artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: oregon.staging.cloud.databricks.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: oregon.staging.cloud.databricks.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: oregon.staging.cloud.databricks.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: oregon.staging.cloud.databricks.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: oregon.staging.cloud.databricks.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: e2-dogfood-core.s3.us-west-2.amazonaws.com. Connection pool size: 10 Registered model 'ml.llm-catalog.mistral7B' already exists. Creating a new version of this model...
Uploading artifacts: 0%| | 0/56 [00:00<?, ?it/s]
2024/03/26 05:27:58 INFO mlflow.store.artifact.cloud_artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: s3.us-west-2.amazonaws.com. Connection pool size: 10 Created version '2' of model 'ml.llm-catalog.mistral7b'.

Step 2: View optimization information for your model

Modify the cell below to change the model name. After calling the model optimization information API, you will be able to retrieve throughput chunk size information for your model. This is the number of tokens/second that corresponds to 1 throughput unit for your specific model.

{ "optimizable": true, "model_type": "mistral", "throughput_chunk_size": 970, "dbus": 24 }

Step 3: Configure and create your model serving GPU endpoint

Modify the cell below to change the endpoint name. After calling the create endpoint API, the logged Llama2 model is automatically deployed with optimized LLM serving.

/bin/bash: line 1: export: `=': not a valid identifier /bin/bash: line 1: export: `fhttps://oregon.staging.cloud.databricks.com/api/2.0/serving-endpoints': not a valid identifier /bin/bash: line 1: export: `=': not a valid identifier { "name": "mistral7B", "creator": "ahmed.bilal@databricks.com", "creation_timestamp": 1711430931000, "last_updated_timestamp": 1711430931000, "state": { "ready": "NOT_READY", "config_update": "IN_PROGRESS" }, "pending_config": { "start_time": 1711430931000, "served_models": [ { "name": "mistral7B-2", "model_name": "ml.llm-catalog.mistral7B", "model_version": "2", "workload_size": "Small", "workload_type": "GPU_MEDIUM", "min_provisioned_throughput": 970, "max_provisioned_throughput": 970, "dbus": 24.0, "min_dbus": 24.0, "max_dbus": 24.0, "state": { "deployment": "DEPLOYMENT_CREATING", "deployment_state_message": "Creating resources for served entity." }, "creator": "ahmed.bilal@databricks.com", "creation_timestamp": 1711430931000 } ], "served_entities": [ { "name": "mistral7B-2", "entity_name": "ml.llm-catalog.mistral7B", "entity_version": "2", "workload_size": "Small", "workload_type": "GPU_MEDIUM", "min_provisioned_throughput": 970, "max_provisioned_throughput": 970, "dbus": 24.0, "min_dbus": 24.0, "max_dbus": 24.0, "state": { "deployment": "DEPLOYMENT_CREATING", "deployment_state_message": "Creating resources for served entity." }, "creator": "ahmed.bilal@databricks.com", "creation_timestamp": 1711430931000 } ], "config_version": 1, "traffic_config": { "routes": [ { "served_model_name": "mistral7B-2", "traffic_percentage": 100, "served_entity_name": "mistral7B-2" } ] } }, "id": "f7faef823cd94963b510510971bf5aeb", "permission_level": "CAN_MANAGE", "route_optimized": false }

View your endpoint

To see your more information about your endpoint, go to the Serving on the left navigation bar and search for your endpoint name.

Step 3: Query your endpoint

Once your endpoint is ready, you can query it by making an API request. Depending on the model size and complexity, it can take 30 minutes or more for the endpoint to get ready.

{ "id": "chatcmpl-e52b94bf7d4f4922b0988375bc1cdef4", "object": "chat.completion", "created": 1711431826, "choices": [ { "index": 0, "message": { "role": "assistant", "content": " AI: Simulating human intelligence through algorithms and data." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 16, "completion_tokens": 12, "total_tokens": 28 } }