Distributed LLM batch inference

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

This page provides notebook examples for LLM batch inference using Ray Data, a scalable data processing library for AI workloads, on serverless GPU compute.

Tutorial	Description
Batch inference using vLLM with Ray Data	This notebook demonstrates how to run LLM inference at scale using Ray Data and vLLM on serverless GPU. It leverages the distributed serverless GPU API to automatically provision and manage multi-node A10 GPUs for distributed inference.
Batch inference using SGLang with Ray Data	SGLang is a high-performance serving framework for LLMs. This notebook demonstrates how to run LLM batch inference using SGLang and Ray Data on Databricks serverless GPU.