Exemplo de uso do SDK Python para Vector Search

Este Notebook mostra como usar o SDK Python de Busca Vetorial, que fornece um VectorSearchClient como API principal para trabalhar com Busca Vetorial.

Alternativamente, você pode chamar a API REST diretamente.

Requisitos

Este Notebook pressupõe que exista um endpoint de modelo de serviço chamado databricks-gte-large-en . Para criar esse endpoint, consulte o Notebook Chamar um modelo de embeddings GTE usando Mosaic AI Model Serving.

Python
%pip install --upgrade --force-reinstall databricks-vectorsearch langchain
dbutils.library.restartPython()

Python
from databricks.vector_search.client import VectorSearchClient

vsc = VectorSearchClient()

Python
help(VectorSearchClient)

Carregar dataset de exemplo na tabela Delta de origem

O procedimento a seguir cria a tabela Delta de origem.

Python

# Specify the catalog and schema to use. You must have USE_CATALOG privilege on the catalog and USE_SCHEMA and CREATE_TABLE privileges on the schema.
# Change the catalog and schema here if necessary.

catalog_name = "main"
schema_name = "default"

Python
source_table_name = "en_wiki"
source_table_fullname = f"{catalog_name}.{schema_name}.{source_table_name}"

Python
# Uncomment if you want to start from scratch.

# spark.sql(f"DROP TABLE {source_table_fullname}")

Python
source_df = spark.read.parquet("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet").limit(10)
display(source_df)

Python
source_df.write.format("delta").option("delta.enableChangeDataFeed", "true").saveAsTable(source_table_fullname)

Python
display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

Criar endpointde pesquisa vetorial

Python
vector_search_endpoint_name = "vector-search-demo-endpoint"

Python
vsc.create_endpoint(
    name=vector_search_endpoint_name,
    endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED"
)

Python
endpoint = vsc.get_endpoint(
  name=vector_search_endpoint_name)
endpoint

Criar índice vetorial

Python
# Vector index
vs_index = "en_wiki_index"
vs_index_fullname = f"{catalog_name}.{schema_name}.{vs_index}"

embedding_model_endpoint = "databricks-gte-large-en"

Python
index = vsc.create_delta_sync_index(
  endpoint_name=vector_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="text",
  embedding_model_endpoint_name=embedding_model_endpoint
)
index.describe()

Obtenha um índice vetorial

Use get_index() para recuperar o objeto de índice do vetor usando o nome do índice do vetor. Você também pode usar describe() no objeto de índice para ver um resumo das informações de configuração do índice.

Python
index = vsc.get_index(endpoint_name=vector_search_endpoint_name, index_name=vs_index_fullname)

index.describe()

Python
# Wait for index to come online. Expect this command to take several minutes.
import time
while not index.describe().get('status').get('detailed_state').startswith('ONLINE'):
  print("Waiting for index to be ONLINE...")
  time.sleep(5)
print("Index is ONLINE")
index.describe()

Busca por similaridade

Consulte o Índice de Vetores para encontrar documentos semelhantes.

Python
# Returns [col1, col2, ...]
# You can set this to any subset of the columns.
all_columns = spark.table(source_table_fullname).columns

results = index.similarity_search(
  query_text="Greek myths",
  columns=all_columns,
  num_results=2)

results

Python
# Search with a filter. Note that the syntax depends on the endpoint type.

# Standard endpoint syntax
results = index.similarity_search(
  query_text="Greek myths",
  columns=all_columns,
  filters={"id NOT": ("13770", "88231")},
  num_results=2)

# Storage-optimized endpoint syntax
# results = index.similarity_search(
#   query_text="Greek myths",
#   columns=all_columns,
#   filters='id NOT IN ("13770", "88231")',
#   num_results=2)

results

Converter resultados em documentos LangChain

A primeira coluna recuperada é carregada em page_content e o restante em metadados.

Python
from langchain_core.documents import Document
from typing import List

def convert_vector_search_to_documents(results) -> List[Document]:
  column_names = []
  for column in results["manifest"]["columns"]:
      column_names.append(column)

  langchain_docs = []
  for item in results["result"]["data_array"]:
      metadata = {}
      score = item[-1]
      # print(score)
      i = 1
      for field in item[1:-1]:
          # print(field + "--")
          metadata[column_names[i]["name"]] = field
          i = i + 1
      doc = Document(page_content=item[0], metadata=metadata)  # , 9)
      langchain_docs.append(doc)
  return langchain_docs

langchain_docs = convert_vector_search_to_documents(results)

langchain_docs

Excluir índice vetorial

Python
vsc.delete_index(index_name=vs_index_fullname)

Exemplo de caderno

Exemplo de uso do SDK Python para Vector Search

Abrir notebook em uma nova aba

Requisitos​

Carregar dataset de exemplo na tabela Delta de origem​

Criar endpointde pesquisa vetorial​

Criar índice vetorial​

Obtenha um índice vetorial​

Busca por similaridade​

Converter resultados em documentos LangChain​

Excluir índice vetorial​

Exemplo de caderno​

Exemplo de uso do SDK Python para Vector Search

Requisitos

Carregar dataset de exemplo na tabela Delta de origem

Criar endpointde pesquisa vetorial

Criar índice vetorial

Obtenha um índice vetorial

Busca por similaridade

Converter resultados em documentos LangChain

Excluir índice vetorial

Exemplo de caderno