Llama 3.2 multimodal example

In this notebook, you learn how to use the Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct models with llama-stack to parse an image of a table into a JSON representation.

These two multimodal models take both text and images as input, and can be used for tasks requiring visual understanding. See the Llama 3.2 11B and Llama 3.2 90B model cards.

Meta Llama 3.2 is licensed under the LLAMA 3.2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring their compliance with the terms of this license and the Llama 3.2 Acceptable Use Policy.

Note: Meta does not grant rights for the multimodal models in the Llama 3.2 license to users domiciled in the European Union, or companies with principle place of business in the European Union. See the Llama 3.2 Acceptable Use Policy for more information.

5

7

9

11

Next, write the following script to disk, so you can use it with the torchrun command. The script is a modified version of the example scripts provided by Meta.

The example loads and processes the following image using the provided prompt. The hardcoded image path assumes that you have downloaded the image into the same folder where this notebook is located. Feel free to experiment with your own image and prompt.

The prompt is marked with ### PROMPT ###, and the image load is marked with ### IMAGE LOAD ###

14

17

19

llama-3.2-multimodal(Python)

Llama 3.2 multimodal example

Requirements

Setup

Download the model

Write the script to disk

Run the model