Retrieval Augmented Generation (RAG) on Databricks

This article provides an overview of retrieval augmented generation (RAG) and describes RAG application support in Databricks.

What is Retrieval Augmented Generation?

RAG is a generative AI design pattern that involves combining a large language model (LLM) with external knowledge retrieval. RAG is required to connect real-time data to your generative AI applications. Doing so improves the accuracy and quality of the application, by providing your data as context to the LLM at inference time.

The Databricks platform provides an integrated set of tools that supports the following RAG scenarios.

Type of RAG

Description

Example use case

Unstructured data

Use of documents - PDFs, wikis, website contents, Google or Microsoft Office documents, and so on.

Chatbot over product documentation

Structured data

Use of tabular data - Delta Tables, data from existing application APIs.

Chatbot to check order status

Tools & function calling

Call third party or internal APIs to perform specific tasks or update statuses. For example, performing calculations or triggering a business workflow.

Chatbot to place an order

Agents

Dynamically decide how to respond to a user’s query by using an LLM to choose a sequence of actions.

Chatbot that replaces a customer service agent

RAG application architecture

The following illustrates the components that make up a RAG application.

RAG application architecture all up

RAG applications require a pipeline and a chain component to perform the following:

  • Indexing A pipeline that ingests data from a source and indexes it. This data can be structured or unstructured.

  • Retrieval and generation This is the actual RAG chain. It takes the user query and retrieves similar data from the index, then passes the data, along with the query, to the LLM model.

The below diagram demonstrates these core components:

RAG application architecture for just the indexing pipeline and retrieval and generation, the RAG chain, pieces of RAG. The top section shows the RAG chain consuming the query and the subsequent steps of query processing, query expansion, retrieval and re-ranking, prompt engineering, initial response generation and post-processing, all before generating a response to the query. The bottom portion shows the RAG chain connected to separate data pipelines for 1. unstructured data, which includes data parsing, chunking and embedding and storing that data in a vector search database or index. Unstructured data pipelines require interaction with embedding and foundational models to feed into the RAG chain and 2. structured data pipelines, which includes consuming already embedded data chunks and performing ETL tasks and feature engineering before serving this data to the RAG chain

Unstructured data RAG example

The following sections describe the details of the indexing pipeline and RAG chain in the context of an unstructured data RAG example.

Indexing pipeline in a RAG app

The following steps describe the indexing pipeline:

  1. Ingest data from your proprietary data source.

  2. Split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.

  3. Use an embedding model to create vector embeddings for the data chunks.

  4. Store the embeddings and metadata in a vector database to make them accessible for querying by the RAG chain.

Retrieval using the RAG chain

After the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG application responds to an incoming request.

  1. Embed the request using the same embedding model that was used to embed the data in the knowledge base.

  2. Query the vector database to do a similarity search between the embedded request and the embedded data chunks in the vector database.

  3. Retrieve the data chunks that are most relevant to the request.

  4. Feed the relevant data chunks and the request to a customized LLM. The data chunks provide context that helps the LLM generate an appropriate response. Often, the LLM has a template for how to format the response.

  5. Generate a response.

The following diagram illustrates this process:

RAG workflow after a request

Develop RAG applications with Databricks

Databricks provides the following capabilities to help you develop RAG applications.

RAG architecture with Databricks

The following architecture diagrams demonstrate where each Databricks feature fits in the RAG workflow. For an example, see the Deploy Your LLM Chatbot With Retrieval Augmented Generation demo.

Process unstructured data and Databricks-managed embeddings

For processing unstructured data and Databricks-managed embeddings, the following diagram steps and diagram show:

  1. Data ingestion from your proprietary data source. You can store this data in a Delta Table or Unity Catalog Volume.

  2. The data is then split into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.

  3. The parsed and chunked data is then consumed by an embedding model to create vector embeddings. In this scenario, Databricks computes the embeddings for you as part of the Vector Search functionality which uses Model Serving to provide an embedding model.

  4. After Vector Search computes embeddings, Databricks stores them in a Delta table.

  5. Also as part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically computes embeddings for new data that is added to the source data table and updates the vector search index.

RAG indexing pipeline processing unstructured data and Databricks managed embeddings. This diagram shows the RAG application architecture for just the indexing pipeline.

Process unstructured data and customer-managed embeddings

For processing unstructured data and customer-managed embeddings, the following steps and diagram show:

  1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.

  2. You can then split the data into chunks that can fit into the context window of the foundational LLM. This step also includes parsing the data and extracting metadata. You can use Databricks Workflows, Databricks Notebooks and Delta Live Tables to perform these tasks. This data is commonly referred to as a knowledge base that the foundational LLM is trained on.

  3. Next, the parsed and chunked data can be consumed by an embedding model to create vector embeddings. In this scenario, you compute the embeddings yourself and can use Model Serving to serve an embedding model.

  4. After you compute embeddings, you can store them in a Delta table, that can be synced with Vector Search.

  5. As part of Vector Search, the embeddings and metadata are indexed and stored in a vector database to make them accessible for querying by the RAG chain. Vector Search automatically syncs new embeddings that are added to your Delta table and updates the vector search index.

RAG with Databricks unstructured data and self managed embeddings

Process structured data

For processing structured data, the following steps and diagram show:

  1. Data ingestion from your proprietary data source. You can store this data in a Delta table or Unity Catalog Volume.

  2. For feature engineering you can use Databricks notebooks, Databricks workflows, and Delta Live Tables.

  3. Create a feature table. A feature table is a Delta table in Unity Catalog that has a primary key.

  4. Create an online table and host it on a feature serving endpoint. The endpoint automatically stays synced with the feature table.

For an example notebook illustrating the use of online tables and feature serving for RAG applications, see the Databricks online tables and feature serving endpoints for RAG example notebook.

RAG with Databricks structured data

RAG chain

After the index is prepared, the RAG chain of the application can be served to respond to questions. The following steps and diagram describe how the RAG chain operates in response to an incoming question.

  1. The incoming question gets embedded using the same embedding model that was used to embed the data in the knowledge base. Model Serving is used to serve the embedding model.

  2. After the question is embedded, you can use Vector Search to do a similarity search between the embedded question and the embedded data chunks in the vector database.

  3. After Vector Search retrieves the data chunks that are most relevant to the request, those data chunks along with relevant features from Feature Serving and the embedded question are consumed in a customized LLM for post processing before it a response is generated.

  4. The data chunks and features provide context that help the LLM generate an appropriate response. Often, the LLM has a template for how to format the response. Once again, Model Serving is used to serve the LLM. You can also use Unity Catalog and Lakehouse Monitoring to store logs and monitor the chain workflow, respectively.

  5. Generate a response.

Running the chain

Region availability

The features that support RAG application development on Databricks are available in the same regions as model serving.

If you plan on using Foundation Model APIs as part of your RAG application development, you are limited to the supported regions for Foundation Model APIs.