Creating a πŸ—ƒοΈ Data Processor version

Conceptual overview

The πŸ—ƒοΈ Data Processor is a data pipeline that parses, chunks, and embeds unstructured documents from a πŸ“₯ Data Ingestor destination UC Volume into chunks stored in a Delta Table and synced to a Unity Catalog Vector Index. A πŸ—ƒοΈ Data Processor is associated with 1+ πŸ“₯ Data Ingestor and can be associated with any number of πŸ” Retrievers.

A πŸ—ƒοΈ Data Processor consists of:

  1. Configuration stored in the data_processors section of rag-config.yml

  2. Code stored in app-directory/src/process_data.py

To parse & chunk documents, you can define any custom Python code, including the use of LangChain TextSplitters.

To simplify experimentation with different settings, Databricks suggests parameterizing your πŸ—ƒοΈ Data Processor using the key:value configuration settings in rag-config.yml. By default, Databricks provides a chunk_size and chunk_overlap configuration, but you can create any custom parameter.

To embed documents, configure an embedding model in rag-config.yml. This embedding model can be any Foundational Model APIs pay-per-token, Foundational Model APIs provisioned throughput, or External Model Endpoint that supports the a `llm/v1/embeddings` task.

The downstream πŸ” Retrievers and πŸ”— Chains reference the πŸ—ƒοΈ Data Processor’s configuration to access this embedding model.

Tip

🚧 Roadmap 🚧 Support for multiple πŸ—ƒοΈ Data Processor per RAG Application. In v2024-01-19, only one πŸ—ƒοΈ Data Processor can be created per RAG Application.

Data flows

legend

Step-by-step instructions

  1. Open the rag-config.yml in your IDE/code editor.

  2. Edit the data_processors configuration.

    data_processors:
      - name: spark-docs-processor
        description: Parse, chunk, embed Spark documentation
        # explicit link to the data ingestors that this processor uses.
        data_ingestors:
          - name: spark-docs-ingestor
        # Optional. The Unity Catalog table where the embedded, chunked docs are stored.
        # If not specified, will default to `{name}__embedded_docs__{version_number}`
        # If specified, will default to `{provided_value}__{version_number}`
        destination_table:
          name: databricks_docs_chunked
        destination_vector_index:
          databricks_vector_search:
            # Optional. The Unity Catalog table where the embedded, chunked docs are stored.
            # If not specified, will default to `{name}__embedded_docs_index__{version_number}`
            # If specified, will default to `{provided_value}__{version_number}`
            index_name: databricks_docs_index
        embedding_model:
          endpoint_name: databricks-bge-large-en
          instructions:
            embedding: ""
            query: "Represent this sentence for searching relevant passages:"
        # You can specify arbitrary key-value pairs as `configurations`
        configurations:
          chunk_size: 500
          chunk_overlap: 50
    
  3. Edit the src/my_rag_builder/document_processor.py to modify the default code or add custom code.

    Note

    You can modify this file in any way you see fit, as long as after the code finishes running, destination_table.name contains the following columns:

    • chunk_id - A unique identifier of the chunk, typically a UUID.

    • doc_uri - A unique identifier of the source document, for example a URL.

    and this data is synchronized to databricks_vector_search.index_name.

  4. You can run the document_processor.py file in a Databricks Notebook or using Databricks Connect to test the processor..