Exemplo de modelo de incorporação externo para Pesquisa de AI (OpenAI)

Open in Databricks

Este Notebook mostra como usar o Python SDK de Pesquisa de IA, que fornece AISearchClient como uma API primária para trabalhar com Pesquisa de IA.

Este Notebook usa o suporte da Databricks para modelos externos para acessar um modelo de incorporação da OpenAI para gerar incorporações.

Python
%pip install --upgrade --force-reinstall databricks-ai-search tiktoken
dbutils.library.restartPython()

Python
from databricks.ai_search.client import AISearchClient

vsc = AISearchClient(disable_notice=True)

Python
# Display help
help(AISearchClient)

Carregar dataset de exemplo em tabela Delta de origem

Isto cria a tabela Delta de origem.

Python
# Specify the catalog and schema to use. You must have USE_CATALOG privilege on the catalog and USE_SCHEMA and CREATE_TABLE privileges on the schema.
# Change the catalog and schema here if necessary.

catalog_name = "main"
schema_name = "default"

Python

source_table_name = "wiki_articles_demo"
source_table_fullname = f"{catalog_name}.{schema_name}.{source_table_name}"

Python
# Uncomment the following line if you want to start from scratch.

# spark.sql(f"DROP TABLE {source_table_fullname}")

Python
source_df = spark.read.parquet("/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet").limit(10)
display(source_df)

Fragmento de dataset de amostra

A fragmentação do dataset de exemplo ajuda a evitar exceder o limite de contexto do modelo de incorporação. O modelo OpenAI suporta até 8192 tokens. No entanto, a Databricks recomenda que os dados sejam divididos em partes de contexto menores para que seja possível alimentar uma variedade maior de exemplos no modelo de raciocínio para seu aplicativo RAG.

Python
import tiktoken
import pandas as pd


max_chunk_tokens = 1024
encoding = tiktoken.get_encoding("cl100k_base")


def chunk_text(text):
    # Encode and then decode within the UDF
    tokens = encoding.encode(text)
    chunks = []
    while tokens:
        chunk_tokens = tokens[:max_chunk_tokens]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        tokens = tokens[max_chunk_tokens:]
    return chunks

# Process the data and store in a new list
pandas_df = source_df.toPandas()
processed_data = []
for index, row in pandas_df.iterrows():
    text_chunks = chunk_text(row['text'])
    chunk_no = 0
    for chunk in text_chunks:
        row_data = row.to_dict()

        # Replace the id column with a new unique chunk id
        # and the text column with the text chunk
        row_data['id'] = f"{row['id']}_{chunk_no}"
        row_data['text'] = chunk

        processed_data.append(row_data)
        chunk_no += 1

chunked_pandas_df = pd.DataFrame(processed_data)
chunked_spark_df = spark.createDataFrame(chunked_pandas_df)

# Write the chunked DataFrame to a Delta table
spark.sql(f"DROP TABLE IF EXISTS {source_table_fullname}")
chunked_spark_df.write.format("delta") \
    .option("delta.enableChangeDataFeed", "true") \
    .saveAsTable(source_table_fullname)

Python
display(spark.sql(f"SELECT * FROM {source_table_fullname}"))

Criar endpoint

Python
ai_search_endpoint_name = "ai-search-demo-endpoint"

Python
vsc.create_endpoint(
    name=ai_search_endpoint_name,
    endpoint_type="STANDARD" # or "STORAGE_OPTIMIZED"
)

Python
vsc.get_endpoint(
  name=ai_search_endpoint_name
)

Registrar o endpoint de modelo de incorporação do OpenAI

Para obter informações detalhadas de uso, consulte a documentação do modelo externo para configurar um endpoint OpenAI.

Para fornecer credenciais, utilize o Gerenciador de segredos do Databricks.

Python
embedding_model_endpoint_name = "openai-embedding-endpoint"

Python
import mlflow.deployments

mlflow_deploy_client = mlflow.deployments.get_deploy_client("databricks")

# Configure the secret manager with the OpenAPI key and provide the
# correct scope and key name below.

mlflow_deploy_client.create_endpoint(
    name=embedding_model_endpoint_name,
    config={
        &quot;served_entities&quot;: [{
            &quot;external_model&quot;: {
                &quot;name&quot;: &quot;text-embedding-ada-002&quot;,
                &quot;provider&quot;: &quot;openai&quot;,
                &quot;task&quot;: &quot;llm/v1/embeddings&quot;,
                &quot;openai_config&quot;: {
                    &quot;openai_api_key&quot;: &quot;{{secrets/demo/openai-api-key}}&quot; # CHANGE ME
                }
            }
    }]
    }
)

Criar índice

Python
# Create index
vs_index = f"{source_table_name}_openai_index"
vs_index_fullname = f"{catalog_name}.{schema_name}.{vs_index}"

Python
index = vsc.create_delta_sync_index(
  endpoint_name=ai_search_endpoint_name,
  source_table_name=source_table_fullname,
  index_name=vs_index_fullname,
  pipeline_type='TRIGGERED',
  primary_key="id",
  embedding_source_column="text",
  embedding_model_endpoint_name=embedding_model_endpoint_name
)
index.describe()['status']['message']

Python
# Wait for index to come online. Expect this command to take several minutes.
# You can also track the status of the index build in Catalog Explorer in the
# Overview tab for the index.

import time
index = vsc.get_index(endpoint_name=ai_search_endpoint_name,index_name=vs_index_fullname)
while not index.describe().get('status')['ready']:
  print("Waiting for index to be ready...")
  time.sleep(30)
print("Index is ready!")
index.describe()

Pesquisa de similaridade

As seguintes células mostram como consultar o índice para encontrar documentos semelhantes.

Python
results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5
  )
rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

Python
# Search with a filter. Note that the syntax depends on the endpoint type.

# Standard endpoint syntax
results = index.similarity_search(
  query_text="Greek myths",
  columns=["id", "text", "title"],
  num_results=5,
  filters={&quot;title NOT&quot;: &quot;Hercules&quot;}
)

# Storage-optimized endpoint syntax
# results = index.similarity_search(
#   query_text="Greek myths",
#   columns=["id", "text", "title"],
#   num_results=5,
#   filters='title != "Hercules"'
#   )

rows = results['result']['data_array']
for (id, text, title, score) in rows:
  if len(text) > 32:
    # trim text output for readability
    text = text[0:32] + "..."
  print(f"id: {id}  title: {title} text: '{text}' score: {score}")

Excluir índice

Python
vsc.delete_index(
  endpoint_name=ai_search_endpoint_name,
  index_name=vs_index_fullname
)

Notebook de exemplo

Exemplo de modelo de incorporação externo de Pesquisa de IA (OpenAI)

Abrir notebook em uma nova aba Open in Databricks

Carregar dataset de exemplo em tabela Delta de origem​

Fragmento de dataset de amostra​

Criar endpoint​

Registrar o endpoint de modelo de incorporação do OpenAI​

Criar índice​

Pesquisa de similaridade​

Excluir índice​

Notebook de exemplo​