非構造化データ用の取得ツールの構築とトレース

Mosaic AI Agent フレームワークを使用して、AI エージェントがドキュメントのコレクションなどの非構造化データをクエリするツールを構築します。このページでは、次の方法について説明します。

レトリーバーを地元で開発する
Unity Catalog 関数を使用してレトリーバーを作成する
外部ベクトルインデックスのクエリ
可観測性のために MLflow トレースを追加する

エージェントツールの詳細については、「 AI エージェントツール」を参照してください。

AI Bridgeを使用したベクトル検索取得ツールをローカルで開発

Databricks ベクトル検索取得ツールの構築を開始する最速の方法は、databricks-langchain や databricks-openaiなどの Databricks AI Bridge パッケージを使用してローカルで開発およびテストすることです。

LangChain/LangGraph
OpenAI

Databricks AI Bridge を含む最新バージョンの databricks-langchain をインストールします。

Bash
%pip install --upgrade databricks-langchain

次のコードは、架空のベクトル検索インデックスをクエリし、それをローカルの LLM にバインドして、ツール呼び出しの動作をテストできるようにする取得ツールのプロトタイプを作成します。

エージェントがツールを理解し、いつ呼び出すかを判断するのに役立つ説明的な tool_description を提供します。

Python
from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-claude-3-7-sonnet")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

直接アクセスインデックスを使用するか、自己管理型の埋め込みを使用する Delta Sync インデックスを使用するシナリオでは、 VectorSearchRetrieverTool を構成し、カスタム埋め込みモデルとテキスト列を指定する必要があります。埋め込みを提供するためのオプションを参照してください。

次の例は、columns キーと embedding キーを使用してVectorSearchRetrieverToolを設定する方法を示しています。

Python
from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
    endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
  num_results=5, # Max number of documents to return
  columns=["primary_key", "text_column"], # List of columns to include in the search
  filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
  query_type="ANN", # Query type ("ANN" or "HYBRID").
  tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
  tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
  text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
  embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

詳細については、 VectorSearchRetrieverToolの API ドキュメントを参照してください。

Databricks AI Bridge を含む最新バージョンの databricks-openai をインストールします。

Bash
%pip install --upgrade databricks-openai

次のコードは、架空のベクトル検索インデックスをクエリし、それを OpenAI の GPT モデルと統合するレトリーバーのプロトタイプです。

エージェントがツールを理解し、いつ呼び出すかを判断するのに役立つ説明的な tool_description を提供します。

ツールの OpenAI の推奨事項の詳細については、 OpenAI 関数呼び出しのドキュメントを参照してください。

Python
from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)

# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation"
)

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": "Using the Databricks documentation, answer what is Spark?"
  }
]
first_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"])  # For self-managed embeddings, optionally pass in openai_client=client

# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
  "role": "tool",
  "tool_call_id": tool_call.id,
  "content": json.dumps(result)
})
second_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

次の例は、columns キーと embedding キーを使用してVectorSearchRetrieverToolを設定する方法を示しています。

Python
from databricks_openai import VectorSearchRetrieverTool

vs_tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
    num_results=5, # Max number of documents to return
    columns=["primary_key", "text_column"], # List of columns to include in the search
    filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
    query_type="ANN", # Query type ("ANN" or "HYBRID").
    tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
    tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
    text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
    embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

詳細については、 VectorSearchRetrieverToolの API ドキュメントを参照してください。

ローカルツールの準備ができたら、エージェントコードの一部として直接プロダクション化したり、見つけやすさとガバナンスが向上する Unity Catalog 関数に移行したりできますが、一定の制限があります。

次のセクションでは、取得子を Unity Catalog 関数に移行する方法を示します。

Unity Catalog関数を用いたベクトル検索レトリーバーツール

Mosaic AI Vector Search インデックスクエリをラップする Unity Catalog 関数を作成できます。このアプローチでは、次のことを行います。

ガバナンスと発見可能性を備えた本番運用のユースケースをサポート
内部で vector_search() SQL 関数を使用します
自動 MLflow トレースをサポート
- 関数の出力を MLflow 取得コンポーネントに揃えるには、 page_content エイリアスと metadata エイリアスを使用する必要があります。
- 追加のメタデータ列は、最上位の出力キーとしてではなく、SQL マップ関数を使用してmetadata列に追加する必要があります。

ノートブックまたは SQL エディターで次のコードを実行して、関数を作成します。

SQL
CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
  -- The agent uses this comment to determine how to generate the query string parameter.
  query STRING
  COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
  chunked_text as page_content,
  map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
  vector_search(
    -- Specify your Vector Search index name here
    index => 'catalog.schema.databricks_docs_index',
    query => query,
    num_results => 5
  )

この取得ツールを AI エージェントで使用するには、 UCFunctionToolkitでラップします。これにより、MLflow ログに RETRIEVER スパンの種類を自動的に生成することで、MLflow による自動トレースが可能になります。

Python
from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
    function_names=[
        "main.default.databricks_docs_vector_search"
    ]
)
tools = toolkit.tools

Unity Catalog 取得ツールには、次の注意事項があります。

SQL クライアントでは、返される行数またはバイト数の制限がある場合があります。データの切り捨てを防ぐには、UDF によって返される列値を切り捨てる必要があります。たとえば、 substring(chunked_text, 0, 8192) を使用して、大きなコンテンツ列のサイズを縮小し、実行中の行の切り捨てを回避できます。
このツールは vector_search() 関数のラッパーであるため、 vector_search() 関数と同じ制限が適用されます。制限事項を参照してください。

UCFunctionToolkitの詳細については Unity Catalog ドキュメントを参照してください。

Databricks の外部でホストされているベクターインデックスをクエリするレトリーバー

ベクターインデックスが Databricks の外部でホストされている場合は、Unity Catalog 接続を作成して外部サービスに接続し、エージェントコードで接続を使用できます。「AI エージェントツールを外部サービスに接続する」を参照してください。

次の例では、Databricks の外部でホストされているベクターインデックスを PyFunc フレーバーエージェントに対して呼び出すレトリーバーを作成します。

外部サービス (この場合は Azure) への Unity Catalog 接続を作成します。

SQL
CREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
  host 'https://example.search.windows.net',
  base_path '/',
  bearer_token secret ('<secret-scope>','<secret-key>')
);

Unity Catalog 接続を使用して、エージェントコードで取得ツールを定義します。この例では、MLflow デコレーターを使用してエージェントトレースを有効にします。

注記

MLflow 取得コンポーネントに準拠するには、取得コンポーネントで List[Document] オブジェクトを返し、Document クラスの metadata フィールドを使用して、返されたドキュメントに doc_uri や similarity_scoreなどの属性を追加する必要があります。「MLflow ドキュメント」を参照してください。

Python
import mlflow
import json

from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict

class VectorSearchRetriever:
  """
  Class using Databricks Vector Search to retrieve relevant documents.
  """

  def __init__(self):
    self.azure_search_index = "hotels_vector_index"

  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
      query: Search query.
      score_threshold: Score threshold to use for the query.

    Returns:
      List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
      "count": true,
      "select": "HotelId, HotelName, Description, Category",
      "vectorQueries": [
        {
          "vector": query_vector,
          "k": 7,
          "fields": "DescriptionVector",
          "kind": "vector",
          "exhaustive": true,
        }
      ],
    }

    response = (
      WorkspaceClient()
      .serving_endpoints.http_request(
        conn=connection_name,
        method=ExternalFunctionRequestHttpMethod.POST,
        path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
        json=json,
      )
      .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

  @mlflow.trace(span_type="PARSER")
  def convert_vector_search_to_documents(
    self, vs_results, score_threshold
  ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
      score = item.get("@search.score", 0)

      if score >= score_threshold:
        metadata = {
          "score": score,
          "HotelName": item.get("HotelName"),
          "Category": item.get("Category"),
        }

        doc = Document(
          page_content=item.get("Description", ""),
          metadata=metadata,
          id=item.get("HotelId"),
        )
        docs.append(doc)

    return docs

レトリーバーを実行するには、次の Python コードを実行します。必要に応じて、結果をフィルタリングするために、要求にベクトル検索フィルターを含めることができます。
Python
```
retriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . .  TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)
```

トレーシングをレトリーバーに追加する

MLflow トレースを追加して、レトリーバーを監視およびデバッグします。トレースでは、実行の各ステップの入力、出力、およびメタデータを表示できます。

前の例では、__call__メソッドと解析メソッドの両方に @mlflow.trace デコレータを追加しています。デコレータは、関数が呼び出されたときに開始し、関数が戻ったときに終了するスパンを作成します。MLflow は、関数の入力と出力、および発生した例外を自動的に記録します。

注記

LangChain、LlamaIndex、および OpenAI ライブラリのユーザーは、デコレータでトレースを手動で定義するだけでなく、MLflow の自動ログ記録も使用できます。「アプリのインストゥルメント: トレースアプローチ」を参照してください。

Python
import mlflow
from mlflow.entities import Document

## This code snippet has been truncated for brevity, see the full retriever example above
class VectorSearchRetriever:
  ...

  # Create a RETRIEVER span. The span name must match the retriever schema name.
  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(...) -> List[Document]:
    ...

  # Create a PARSER span.
  @mlflow.trace(span_type="PARSER")
  def parse_results(...) -> List[Document]:
    ...

Agent Evaluation や AI Playground などのダウンストリームアプリケーションでレトリーバートレースが正しくレンダリングされるようにするには、デコレータが次の要件を満たしていることを確認してください。

(https://mlflow.org/docs/latest/tracing/tracing-schema.html#retriever-spans) を使用して、関数が List[Document] オブジェクトを返すことを確認します。
トレースを正しく構成するには、トレース名と retriever_schema 名が一致している必要があります。取得者スキーマの設定方法については、次のセクションを参照してください。

MLflow の互換性を確保するために取得者スキーマを設定する

レトリーバーまたは span_type="RETRIEVER" から返されたトレースが MLflow の標準レトリーバースキーマに準拠していない場合は、返されたスキーマを MLflow の予期されるフィールドに手動でマップする必要があります。これにより、MLflow はレトリーバーを適切にトレースし、ダウンストリームアプリケーションでトレースをレンダリングできます。

取得者スキーマを手動で設定するには:

コール mlflow.models.set_retriever_schemaエージェントを定義するとき。set_retriever_schema を使用して、返されたテーブルの列名を MLflow の想定フィールド (primary_key、text_column、doc_uriなど) にマップします。
Python
```
# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
  name="vector_search",
  primary_key="chunk_id",
  text_column="text_column",
  doc_uri="doc_uri"
  # other_columns=["column1", "column2"],
)
```
レトリーバーのスキーマで追加の列を指定するには、 other_columns フィールドに列名のリストを指定します。
複数のレトリーバーがある場合は、各レトリーバースキーマに一意の名前を使用して、複数のスキーマを定義できます。

エージェント作成時に設定された取得者スキーマは、レビューアプリや評価セットなどのダウンストリームアプリケーションやワークフローに影響します。具体的には、 doc_uri 列は、取得者によって返されるドキュメントのプライマリ識別子として機能します。

レビューアプリ にはdoc_uriが表示され、レビュー担当者が回答を評価し、ドキュメントの出所を追跡するのに役立ちます。アプリのUIを確認するを参照してください。
評価セット では、 doc_uri を使用して、レトリーバーの結果を事前定義された評価データセットと比較し、レトリーバーの再現率と精度を判断します。「評価セット (MLflow 2)」を参照してください。

次のステップ

レトリーバーを構築した後、最後のステップは、それをAIエージェント定義に統合することです。エージェントにツールを追加する方法については、「エージェントに Unity Catalog ツールを追加する」を参照してください。

AI Bridgeを使用したベクトル検索取得ツールをローカルで開発​

Unity Catalog関数を用いたベクトル検索レトリーバーツール​

Databricks の外部でホストされているベクターインデックスをクエリするレトリーバー​

トレーシングをレトリーバーに追加する​

MLflow の互換性を確保するために取得者スキーマを設定する​

次のステップ​