simple_structured_data_extraction_ai_query

Example: Structured data extraction, batch inference & evaluation

This notebook demonstrates how to perform basic structured data extraction using ai_query (AWS | Azure).

The process illustrates how to effectively transform raw, unstructured data into organized, actionable information through automated extraction techniques.

This notebook also shows how to leverage Mosaic AI Agent Evaluation (AWS | Azure) to evaluate the accuracy if ground truth data is available.

Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

Perform batch inference using `ai_query`

To demonstrate how to use ai_query for structured data extraction, this notebook creates a simulated dataset of employment contracts. This dummy dataset will serve as a testbed for entity extraction, focusing on key information such as employer and employee names. It includes the ground-truth of data to be extracted, which is used later for evaluation.

This notebook then utilizes this dataset to conduct batch inference using ai_query(AWS | Azure).

Dummy contract data

Table

To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

Structured data extraction with `ai_query`

The next cell defines the main input required to perform structured data extraction with ai_query:

The LLM endpoint name
The prompt instructing the LLM to perform data extraction and to use JSON as response format
The JSON schema of the response

Define prompt and response format

Batch inference

Below, ai_query is applied to the Spark dataframe as a SQL expression using the inputs defined above. The LLM's response, which is a JSON string, is parsed to extract the individual data points.

Use ai_query for batch inference

Table

To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

Evaluate the agent with Agent Evaluation

To assess the agent's quality, we'll use the Agent Evaluation framework (AWS | Azure). This approach employs a correctness judge to compare expected entities (or facts) with the actual response, providing a comprehensive evaluation of the agent's performance.

Note: An alternative approach would be to compute metrics such as recall and precision for individual entities, though this would require additional data transformations or custom metrics.

/local_disk0/.ephemeral_nfs/envs/pythonEnv-e2498a49-39a1-49c7-ae15-8ebf4f3df006/lib/python3.11/site-packages/mlflow/pyfunc/utils/data_validation.py:166: UserWarning: Add type hints to the `predict` method to enable data validation and automatic signature inference during model logging. Check https://mlflow.org/docs/latest/model/python_model.html#type-hint-usage-in-pythonmodel for more details. color_warning(

Next steps

For further insights and related examples of structured data extraction on Databricks, consider exploring these comprehensive technical blog posts:

End-to-End Structured Extraction with LLM – Part 1: Batch Entity Extraction
End-to-End Structured Extraction with LLM – Part 2: Fine-Tuning with Synthetic Data