databricks-logo

structured_data_extraction_spark_udf

(Python)
Loading...

Example: Structured data extraction & batch inference

This notebook demonstrates the development, logging, and evaluation of a simple agent for structured data extraction. While the agent implemented is rather simple, the approach demonstrates how to implement custom and thus arbitrarily complex agents for batch inference using MLflow's PythonModel class.

This example showcases the application of a custom agent in batch inference across a set of unstructured documents. The process illustrates how to effectively transform raw, unstructured data into organized, actionable information through automated extraction techniques.

This notebooks shows how to leverage Mosaic AI Agent Evaluation (AWS | Azure) to evaluate the accuracy if ground truth data is available.

2

Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

Define the extraction agent

Below, define your agent code in a single cell. This enables you to easily write it to a local Python file for subsequent logging and deployment using the %%writefile magic command.

The extraction agent implements MLflow's PythonModel interface which can then be easily used as a Spark User-Defined Function (UDF) for batch inference.

Define agent

AsyncFlushFaliedException: One or more writes may have failed when writing to Databricks Workspace. : 404 Not Found: RESOURCE_DOES_NOT_EXIST: The parent folder (/Users/david.tempelmann@databricks.com/week_of_agents) does not exist. Request ID: f282dad1-5244-4c93-941c-725d08972c1c: errno: no such file or directory statusCode: 404, message: nil, stack: nil
Request ID: f282dad1-5244-4c93-941c-725d08972c1c: errno: no such file or directory statusCode: 404, message: nil, stack: nil

at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:343)

Log the agent as an MLflow Model

Log the agent as code from the extractor.py file. See MLflow Models from Code.

6

    Command skipped
    Log the agent model

    Command skipped

    Batch inference & evaluation

    To assess the agent's performance, create a simulated dataset of employment contracts. This dummy dataset will serve as a testbed for entity extraction, focusing on key information such as employer and employee names.

    This notebook then utilizes this dataset to conduct batch inference testing, employing the logged agent model as a Spark User-Defined Function (UDF). This approach allows us to evaluate the agent's effectiveness in processing multiple documents simultaneously and extracting relevant entities at scale.

    Dummy contract data

    Command skipped
    Load agent as Spark UDF

    Command skipped

    Tracing

    Note: In the next cell, pass the run_id of the active experiment run to the Spark UDF. The run_id is used by the agent model to log the traces. Navigate to the MLflow experiment run to inspect the traces for each LLM request. In addition to the full request and response, these contain the token statistics defined in the agent model.

    Batch inference

    Command skipped

    Evaluate the agent with Agent Evaluation

    To assess the agent's quality, use the Agent Evaluation framework (AWS | Azure). This approach employs a correctness judge to compare expected entities (or facts) with the actual response, providing a comprehensive evaluation of the agent's performance.

    Note: An alternative approach would be to compute metrics such as recall and precision for individual entities, though this would require additional data transformations or custom metrics.

    14

    Command skipped

    Next steps

    If the evaluation is successful, the next step would be to register the model in Unity Catalog for use in production. For more information about deploying generative AI and machine learning models to production, refer to the Big Book of MLOps.

    For further insights and related examples of structured data extraction on Databricks, consider exploring these comprehensive technical blog posts:

    ;