Skip to main content

Query audio and video models

This page describes how to send audio and video inputs to Gemini foundation models served by Unity AI Gateway on Databricks. You can provide media as a URL or as base64-encoded inline data using the Chat Completions API or the Google Gemini API.

Requirements

Input methods

You can provide audio and video inputs using two methods:

  • URL: Pass a publicly accessible URL to the media file. For video, YouTube URLs are also supported.
  • Base64 inline data: Encode the file as a base64 string and pass it as a data URI (for example, data:video/mp4;base64,<encoded_data>).
note

The following examples are based on Unity AI Gateway and model services. If you use model serving endpoints instead of model services, replace the model service name with an endpoint name. See Databricks-hosted foundation models available in Foundation Model APIs for a list of available foundation models and their model service and endpoint names.

Chat Completions API

The chat completions API allows you to pass video and audio input. Use the video_url and audio_url content types in the messages array to pass media inputs. Each content item includes a url field that accepts either a web URL or a base64 data URI.

Video input

Python
import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize what happens in these videos."},
{
"type": "video_url",
"video_url": {"url": "https://example.com/sample-video.mp4"}
},
{
"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{video_b64}"}
},
]
}],
max_tokens=1024
)

print(response.choices[0].message.content)

Audio input

Python
import os
import base64
from openai import OpenAI

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')
DATABRICKS_BASE_URL = os.environ.get('DATABRICKS_BASE_URL')

client = OpenAI(
api_key=DATABRICKS_TOKEN,
base_url=DATABRICKS_BASE_URL
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
model="system.ai.gemini-3-1-pro",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and summarize the key points."},
{
"type": "audio_url",
"audio_url": {"url": "https://example.com/sample-audio.mp3"}
},
{
"type": "audio_url",
"audio_url": {"url": f"data:audio/mp3;base64,{audio_b64}"}
},
]
}],
max_tokens=1024
)

print(response.choices[0].message.content)

Google Gemini API

Use the Google Gemini API to pass media as inlineData (base64-encoded) or fileData (URL reference) within the parts array.

Video input

Python
from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://example.staging.cloud.databricks.com/serving-endpoints/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)

# Encode a local video file as base64
with open("video.mp4", "rb") as f:
video_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
model="databricks-gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Summarize what happens in these videos."),
types.Part(
file_data=types.FileData(
mime_type="video/mp4",
file_uri="https://example.com/sample-video.mp4",
)
),
types.Part(
inline_data=types.Blob(
mime_type="video/mp4",
data=video_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)

print(response.text)

Audio input

Python
from google import genai
from google.genai import types
import base64
import os

DATABRICKS_TOKEN = os.environ.get('DATABRICKS_TOKEN')

client = genai.Client(
api_key="databricks",
http_options=types.HttpOptions(
base_url="https://example.staging.cloud.databricks.com/serving-endpoints/gemini",
headers={
"Authorization": f"Bearer {DATABRICKS_TOKEN}",
},
),
)

# Encode a local audio file as base64
with open("audio.mp3", "rb") as f:
audio_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.models.generate_content(
model="databricks-gemini-3-1-pro",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="Transcribe this audio and summarize the key points."),
types.Part(
file_data=types.FileData(
mime_type="audio/mp3",
file_uri="https://example.com/sample-audio.mp3",
)
),
types.Part(
inline_data=types.Blob(
mime_type="audio/mp3",
data=audio_b64,
)
),
],
),
],
config=types.GenerateContentConfig(
max_output_tokens=1024,
),
)

print(response.text)

Supported models

Audio and video inputs are supported on the following Gemini pay-per-token foundation models. See Databricks-hosted foundation models available in Foundation Model APIs for region availability.

  • databricks-gemini-3-1-pro
  • databricks-gemini-3-pro
  • databricks-gemini-2-5-pro
  • databricks-gemini-3-1-flash-lite
  • databricks-gemini-3-flash
  • databricks-gemini-2-5-flash

Limitations

  • Audio and video inputs are only available on Gemini pay-per-token foundation models. Provisioned throughput endpoints are not supported.
  • Multiple audio or video inputs can be included in a single request, but large files increase latency and token usage.

Additional resources