Skip to main content

Distributed LLM batch inference

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

This page provides notebook examples for LLM batch inference using Ray Data, a scalable data processing library for AI workloads, on serverless GPU compute.

Batch inference using vLLM with Ray Data

This notebook demonstrates how to run LLM inference at scale using Ray Data and vLLM on serverless GPU. It leverages the distributed serverless GPU API to automatically provision and manage multi-node A10 GPUs for distributed inference.

vLLM Batch Inference

Open notebook in new tab

Batch inference using SGLang with Ray Data

SGLang is a high-performance serving framework for LLMs. This notebook demonstrates how to run LLM batch inference using SGLang and Ray Data on Databricks serverless GPU.

SGLang Batch Inference

Open notebook in new tab