Distributed LLM batch inference
Beta
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
This page provides notebook examples for LLM batch inference using Ray Data, a scalable data processing library for AI workloads, on serverless GPU compute.
Batch inference using vLLM with Ray Data
This notebook demonstrates how to run LLM inference at scale using Ray Data and vLLM on serverless GPU. It leverages the distributed serverless GPU API to automatically provision and manage multi-node A10 GPUs for distributed inference.
vLLM Batch Inference
Batch inference using SGLang with Ray Data
SGLang is a high-performance serving framework for LLMs. This notebook demonstrates how to run LLM batch inference using SGLang and Ray Data on Databricks serverless GPU.