llama-3.2-multimodal(Python)

Loading...

Llama 3.2 multimodal example

In this notebook, you learn how to use the Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct models with llama-stack to parse an image of a table into a JSON representation.

These two multimodal models take both text and images as input, and can be used for tasks requiring visual understanding. See the Llama 3.2 11B and Llama 3.2 90B model cards.

Meta Llama 3.2 is licensed under the LLAMA 3.2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring their compliance with the terms of this license and the Llama 3.2 Acceptable Use Policy.

Note: Meta does not grant rights for the multimodal models in the Llama 3.2 license to users domiciled in the European Union, or companies with principle place of business in the European Union. See the Llama 3.2 Acceptable Use Policy for more information.

Requirements

To run the Llama-3.2-11B-Vision-Instruct model, one GPU with 30+ GB of VRAM is sufficient. Whereas, the Llama-3.2-90B-Vision-Instruct model is set up for a single node using 8 GPUs, each with 30+ GB of VRAM.

Setup

First, see https://www.llama.com/llama-downloads/ for model licensing information and instructions to get your presigned download URL from Meta.

Next, install the necessary packages.

5

You can use the Llama CLI to view what models are available. Use the following to see the available Llama series models.

7

Download the model

The following defines your download path:

9

Download the model to /root/.llama/checkpoints/<model name> and copy and paste your presigned URL into the following:

11

Write the script to disk

Next, write the following script to disk, so you can use it with the torchrun command. The script is a modified version of the example scripts provided by Meta.

The example loads and processes the following image using the provided prompt. The hardcoded image path assumes that you have downloaded the image into the same folder where this notebook is located. Feel free to experiment with your own image and prompt.

The prompt is marked with ### PROMPT ###, and the image load is marked with ### IMAGE LOAD ###

14

Run the model

If using the 90B model, which is split 8 ways, you need to run on 8 GPUs. The --nproc_per_node 8 accomplishes this, assuming you are connected to a machine with 8 GPUs.

17

If using the 11B model, you can run on a single GPU, using the following:

19