Prompt caching
MLflow automatically caches loaded prompts in memory to improve performance and reduce repeated API calls. The caching strategy varies based on whether you access prompts by version or alias.
Prerequisites
- Prompt caching is supported on MLflow version 3.8 and above.
Default caching behavior
MLflow implements differentiated caching:
- Version-based prompts: Cached indefinitely because prompt versions are immutable after creation. For example,
prompts:/summarization-prompt/1. - Alias-based prompts: Cached for 60 seconds by default because aliases can point to different versions over time. For example,
prompts:/summarization-prompt@production.
Per-request cache control
Override caching behavior using the cache_ttl_seconds parameter in load_prompt():
Python
import mlflow
# Custom TTL: cache for 5 minutes
prompt = mlflow.genai.load_prompt(
"prompts:/summarization-prompt/1",
cache_ttl_seconds=300
)
# Bypass cache: always fetch fresh data
prompt = mlflow.genai.load_prompt(
"prompts:/summarization-prompt@production",
cache_ttl_seconds=0
)
# Infinite caching for alias-based prompts
prompt = mlflow.genai.load_prompt(
"prompts:/summarization-prompt@production",
cache_ttl_seconds=float("inf")
)
Global cache configuration
Set system-wide cache defaults using environment variables:
MLFLOW_ALIAS_PROMPT_CACHE_TTL_SECONDS: Default TTL for alias-based promptsMLFLOW_VERSION_PROMPT_CACHE_TTL_SECONDS: Default TTL for version-based prompts
To disable caching globally, set the variable to 0.
Cache invalidation
The cache automatically clears when a prompt is modified, including:
- Tag updates
- Alias changes
- Version deletions