Skip to main content

Query vision models

In this article, you learn how to write query requests for foundation models optimized for vision tasks, and send them to your model serving endpoint.

Mosaic AI Model Serving provides a unified API to understand and analyze images using a variety of foundation models, unlocking powerful multimodal capabilities. This functionality is available through select Databricks-hosted models as part of Foundation Model APIs and serving endpoints that serve external models.

Requirements

Query examples

To use the OpenAI client, specify the model serving endpoint name as the model input.

Python

from openai import OpenAI
import base64
import httpx

client = OpenAI(
api_key="dapi-your-databricks-token",
base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

# encode image
image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_data = base64.standard_b64encode(httpx.get(image_url).content).decode("utf-8")

# OpenAI request
completion = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "what's in this image?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
}
],
)

print(completion.choices[0].message.content)

The Chat Completions API supports multiple image inputs, allowing the model to analyze each image and synthesize information from all inputs to generate a response to the prompt.

Python

from openai import OpenAI
import base64
import httpx

client = OpenAI(
api_key="dapi-your-databricks-token",
base_url="https://example.staging.cloud.databricks.com/serving-endpoints"
)

# Encode multiple images

image1_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image1_data = base64.standard_b64encode(httpx.get(image1_url).content).decode("utf-8")

image2_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image2_data = base64.standard_b64encode(httpx.get(image1_url).content).decode("utf-8")

# OpenAI request

completion = client.chat.completions.create(
model="databricks-claude-3-7-sonnet",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What are in these images? Is there any difference between them?"},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image1_data}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image2_data}"},
},
],
}
],
)

print(completion.choices[0].message.content)

Supported models

See Foundation model types for supported vision models.

Input image requirements

Model

Supported formats

Multiple images per request

Image size limitations

Image resizing recommendations

Image quality considerations

databricks-gemma-3-12b

  • JPEG
  • PNG
  • WebP
  • GIF

Up to 5 images for API requests

  • All provided images are processed in a request.

File size limit: 10 MB total across all images per API request

N/A

N/A

databricks-llama-4-maverick

  • JPEG
  • PNG
  • WebP
  • GIF

Up to 5 images for API requests

  • All provided images are processed in a request.

File size limit: 10 MB total across all images per API request

N/A

N/A

databricks-gpt-5

  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)

Up to 500 individual image inputs per request

File size limit: Up to 10 MB total payload size per request

N/A

  • No watermarks or logos
  • Clear enough for a human to understand

databricks-gpt-5-mini

  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)

Up to 500 individual image inputs per request

File size limit: Up to 10 MB total payload size per request

N/A

  • No watermarks or logos
  • Clear enough for a human to understand

databricks-gpt-5-nano

  • JPEG
  • PNG
  • WebP
  • GIF (Non-animated GIF)

Up to 500 individual image inputs per request

File size limit: Up to 10 MB total payload size per request

N/A

  • No watermarks or logos
  • Clear enough for a human to understand

databricks-claude-3-7-sonnet

  • JPEG
  • PNG
  • GIF
  • WebP
  • Up to 20 images for Claude.ai
  • Up to 100 images for API requests
  • All provided images are processed in a request, which is useful for comparing or contrasting them.
  • Images larger than 8000x8000 px are rejected.
  • If more than 20 images are submitted in one API request, the maximum allowed size per image is 2000 x 2000 px.

For optimal performance, resize images before uploading if they are too large.

  • If an image's long edge exceeds 1568 pixels or its size exceeds ~1,600 tokens, it is automatically scaled down while preserving aspect ratio.
  • Very small images (under 200 pixels on any edge) may degrade performance.
  • To reduce latency, keep images within 1.15 megapixels and at most 1568 pixels in both dimensions.
  • Clarity: Avoid blurry or pixelated images.
  • Text in images:
    • Ensure text is legible and not too smal.
    • Avoid cropping out key visual context just to enlarge the text.

Image to token conversion

This section applies only to Foundation Model APIs. For external models, refer to the provider's documentation.

Each image in a request to a foundation model adds to your token usage. See the pricing calculator to estimate image pricing based on the token usage and model you are using.

Limitations of image understanding

This section applies only to Foundation Model APIs. For external models, refer to the provider's documentation.

The following are image understanding limitations for the supported Databricks-hosted foundation models:

Model

Limitations

databricks-claude-3-7-sonnet

The following are the limits for Claude models on Databricks:

  • Avoid using Claude for tasks requiring perfect precision or sensitive analysis without human oversight.
  • People identification: Cannot identify or name people in images.
  • Accuracy: May misinterpret low-quality, rotated, or very small images (200 px).
  • Spatial reasoning: Struggles with precise layouts, such as reading analog clocks or chess positions.
  • Counting: Provides approximate counts, but may be inaccurate for many small objects.
  • AI-generated images: Cannot reliably detect synthetic or fake images.
  • Inappropriate content: Blocks explicit or policy-violating images.
  • Healthcare: Not suited for complex medical scans (for example, CTs and MRIs). It's not a diagnostic tool.

Additional resources