databricks-logo

simple_structured_data_extraction_ai_query

(Python)
Loading...

Example: Structured data extraction, batch inference & evaluation

This notebook demonstrates how to perform basic structured data extraction using ai_query (AWS | Azure).

The process illustrates how to effectively transform raw, unstructured data into organized, actionable information through automated extraction techniques.

This notebook also shows how to leverage Mosaic AI Agent Evaluation (AWS | Azure) to evaluate the accuracy if ground truth data is available.

2

Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

Perform batch inference using ai_query

To demonstrate how to use ai_query for structured data extraction, this notebook creates a simulated dataset of employment contracts. This dummy dataset will serve as a testbed for entity extraction, focusing on key information such as employer and employee names. It includes the ground-truth of data to be extracted, which is used later for evaluation.

This notebook then utilizes this dataset to conduct batch inference using ai_query(AWS | Azure).

Dummy contract data

Structured data extraction with ai_query

The next cell defines the main input required to perform structured data extraction with ai_query:

  • The LLM endpoint name
  • The prompt instructing the LLM to perform data extraction and to use JSON as response format
  • The JSON schema of the response
Define prompt and response format

Batch inference

Below, ai_query is applied to the Spark dataframe as a SQL expression using the inputs defined above. The LLM's response, which is a JSON string, is parsed to extract the individual data points.

Use ai_query for batch inference

Evaluate the agent with Agent Evaluation

To assess the agent's quality, we'll use the Agent Evaluation framework (AWS | Azure). This approach employs a correctness judge to compare expected entities (or facts) with the actual response, providing a comprehensive evaluation of the agent's performance.

Note: An alternative approach would be to compute metrics such as recall and precision for individual entities, though this would require additional data transformations or custom metrics.

10

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e2498a49-39a1-49c7-ae15-8ebf4f3df006/lib/python3.11/site-packages/mlflow/pyfunc/utils/data_validation.py:166: UserWarning: Add type hints to the `predict` method to enable data validation and automatic signature inference during model logging. Check https://mlflow.org/docs/latest/model/python_model.html#type-hint-usage-in-pythonmodel for more details. color_warning(

Next steps

For further insights and related examples of structured data extraction on Databricks, consider exploring these comprehensive technical blog posts:

;