Retrieval Augmented Generation (RAG) on Databricks
This article provides an overview of retrieval augmented generation (RAG) and describes RAG application support in Databricks.
What is Retrieval Augmented Generation?
RAG is a generative AI design pattern that involves combining a large language model (LLM) with external knowledge retrieval. RAG is required to connect real-time data to your generative AI applications. Doing so improves the accuracy and quality of the application, by providing your data as context to the LLM at inference time.
The Databricks platform provides an integrated set of tools that supports the following RAG scenarios.
Type of RAG |
Description |
Example use case |
---|---|---|
Unstructured data |
Use of documents - PDFs, wikis, website contents, Google or Microsoft Office documents, and so on. |
Chatbot over product documentation |
Structured data |
Use of tabular data - Delta Tables, data from existing application APIs. |
Chatbot to check order status |
Tools & function calling |
Call third party or internal APIs to perform specific tasks or update statuses. For example, performing calculations or triggering a business workflow. |
Chatbot to place an order |
Agents |
Dynamically decide how to respond to a user’s query by using an LLM to choose a sequence of actions. |
Chatbot that replaces a customer service agent |
RAG application architecture
The following illustrates the components that make up a RAG application.
RAG applications require a pipeline and a chain component to perform the following:
Indexing A pipeline that ingests data from a source and indexes it. This data can be structured or unstructured.
Retrieval and generation This is the actual RAG chain. It takes the user query and retrieves similar data from the index, then passes the data, along with the query, to the LLM model.
The below diagram demonstrates these core components:
Unstructured data RAG example
The following sections describe the details of the indexing pipeline and RAG chain in the context of an unstructured data RAG example.
Indexing pipeline in a RAG app
The following steps describe the indexing pipeline:
Ingest data from your proprietary data source.
Split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.
Use an embedding model to create vector embeddings for the data chunks.
Store the embeddings and metadata in a vector database to make them accessible for querying by the RAG chain.
Retrieval using the RAG chain
After the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG application responds to an incoming request.
Embed the request using the same embedding model that was used to embed the data in the knowledge base.
Query the vector database to do a similarity search between the embedded request and the embedded data chunks in the vector database.
Retrieve the data chunks that are most relevant to the request.
Feed the relevant data chunks and the request to a customized LLM. The data chunks provide context that helps the LLM generate an appropriate response. Often, the LLM has a template for how to format the response.
Generate a response.
The following diagram illustrates this process:
Develop RAG applications with Databricks
Databricks provides the following capabilities to help you develop RAG applications.
Unity Catalog for governance, discovery, versioning, and access control for data, features, models, and functions.
Notebooks and workflows for data pipeline creation and orchestration.
Delta tables for storing structured data and unstructured data chunks and embeddings.
Vector search provides a queryable vector database that stores embedding vectors and can be configured to automatically sync to your knowledge base.
Databricks model serving for deploying LLMs and hosting your RAG chain. You can configure a dedicated model serving endpoint specifically for accessing state-of-the-art open LLMs with Foundation Model APIs or third-party models with External models.
MLflow for RAG chain development tracking and LLM evaluation.
Feature engineering and serving. This typically applies for structured data RAG scenarios.
Online Tables. You can serve online tables as a low-latency API to include the data in RAG applications.
Lakehouse Monitoring for data monitoring and tracking model prediction quality and drift using automatic payload logging with inference tables.
AI Playground. A chat-based UI to test and compare LLMs.
RAG architecture with Databricks
The following architecture diagrams demonstrate where each Databricks feature fits in the RAG workflow. For an example, see the Deploy Your LLM Chatbot With Retrieval Augmented Generation demo.
Process unstructured data and Databricks-managed embeddings
For processing unstructured data and Databricks-managed embeddings, the following diagram steps and diagram show:
Data ingestion from your proprietary data source. You can store this data in a Delta Table or Unity Catalog Volume.
The data is then split into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.
The parsed and chunked data is then consumed by an embedding model to create vector embeddings. In this scenario, Databricks computes the embeddings for you as part of the Vector Search functionality which uses Model Serving to provide an embedding model.
After Vector Search computes embeddings, Databricks stores them in a Delta table.
Also as part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically computes embeddings for new data that is added to the source data table and updates the vector search index.
Process unstructured data and customer-managed embeddings
For processing unstructured data and customer-managed embeddings, the following steps and diagram show:
Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.
You can then split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks Notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.
Next, the parsed and chunked data can be consumed by an embedding model to create vector embeddings. In this scenario, you compute the embeddings yourself and can use Model Serving to serve an embedding model.
After you compute embeddings, you can store them in a Delta table, that can be synced with Vector Search.
As part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically syncs new embeddings that are added to your Delta table and updates the vector search index.
Process structured data
For processing structured data, the following steps and diagram show:
Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.
For feature engineering you can use Databricks notebooks, Databricks workflows, and Delta Live Tables.
Create a feature table. A feature table is a Delta table in Unity Catalog that has a primary key.
Create an online table and host it on a feature serving endpoint. The endpoint automatically stays synced with the feature table.
For an example notebook illustrating the use of online tables and feature serving for RAG applications, see the Databricks online tables and feature serving endpoints for RAG example notebook.
RAG chain
After the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG chain operates in response to an incoming question.
The incoming question gets embedded using the same embedding model that was used to embed the data in the knowledge base. Model Serving is used to serve the embedding model.
After the question is embedded, you can use Vector Search to do a similarity search between the embedded question and the embedded data chunks in the vector database.
After Vector Search retrieves the data chunks that are most relevant to the request, those data chunks along with relevant features from Feature Serving and the embedded question are consumed in a customized LLM for post processing before it a response is generated.
The data chunks and features provide context that help the LLM generate an appropriate response. Often, the LLM has a template for how to format the response. Once again, Model Serving is used to serve the LLM. You can also use Unity Catalog and Lakehouse Monitoring to store logs and monitor the chain workflow, respectively.
Generate a response.
Region availability
The features that support RAG application development on Databricks are available in the same regions as model serving.
If you plan on using Foundation Model APIs as part of your RAG application development, you are limited to the supported regions for Foundation Model APIs.