RAG data pipeline description and processing steps
Understand how to prepare unstructured data for RAG applications. Unstructured data includes anything without specific structure or organization, such as PDF files with text and images or multimedia content such as audio and video.
Prepare unstructured data for retrieval
The unstructured data pipeline prepares data for retrieval using semantic search. Semantic search interprets the meaning and intent behind a user query to deliver more relevant results. Semantic search is just one approach for implementing the retrieval component of a RAG application.
Computing similarity can be resource-intensive. Vector indexes, such as Mosaic AI Vector Search, optimize this process by organizing and navigating embeddings efficiently, often using advanced approximation methods that avoid the need to compare every embedding individually.
Build a RAG application data pipeline
Each step in the data pipeline involves decisions that impact the RAG application’s quality. For more information and a runnable notebook example, see Build an unstructured data pipeline for RAG.
The following are the typical steps of a data pipeline in a RAG application using unstructured data:
- Corpus composition and ingestion: Select the right data sources and content based on the specific use case.
- Data preprocessing: Transform raw data into a clean, consistent format suitable for embedding and retrieval.
- Parsing: Extract relevant information from the raw data using appropriate parsing techniques.
- Enrichment: Enrich data with additional metadata and remove noise.
- Chunking: Break down the parsed data into smaller, manageable chunks for efficient retrieval.
- Embedding: Convert the chunked text data into a numerical vector representation that captures its semantic meaning.
- Indexing and storage: Create efficient vector indices for optimized search performance.