Skip to main content

Information Extraction

Preview

This feature is in Public Preview and is HIPAA compliant.

This page covers the new version of Information Extraction. For information about the previous version, see Use Information Extraction (legacy)

Information Extraction transforms unstructured documents and text into key, structured insights using a defined schema. This allows information embedded in unstructured text, PDFs, images, or tables to be directly used for analysis, reporting, or downstream agents and applications.

Examples of information extraction include:

  • Extracting legal parties and terms from contracts.
  • Extracting line items and payment terms from invoices.
  • Pulling key details from medical records and notes.

Information Extraction is built on top of the AI function, ai_extract. Information Extraction has a visual UI to customize and optimize the function with a defined schema for extraction.

Information Extraction uses default storage to store temporary data transformations, model checkpoints, and internal metadata that power each agent. On agent deletion, all data associated with the agent is removed from default storage.

Requirements

Create an information extraction agent

Go to Agents icon. Agents in the left navigation pane of your workspace. Click Create Agent > Information Extraction.

Step 1. Select the data to extract information from

  1. Select the files or data you want to extract information from. You can upload files, select a Unity Catalog volume with supported file types, or a table that contains text data.

  2. Click Create Agent.

Step 2. Configure and refine your extraction schema

After Information Extraction processes your data, configure and refine what data you want to extract from your documents.

  1. Under Configuration, define your extraction schema. There are several ways to do this:

    • Enter natural language that describes the information you want to extract and click Generate Schema. Information Extraction intelligently auto-generates a JSON schema with field names and definitions for you. Edit these descriptions as needed.
    • Alternatively, click Or, Define manually to manually define your schema:
      1. Click Add field.
      2. Enter your field name, type, and description.
      3. Click Confirm.
      4. Repeat for each field you want to extract.
      5. Click Save and Run extraction.
    • You can also click JSON to edit the JSON schema directly. Click Apply Changes when complete.

    Each time you update your schema and hit Save and run extraction, Information Extraction updates the extraction agent, runs the extraction, and shows the results for each input.

  2. On the left, review the parsed document and the agent's extraction. Iterate the extraction results in two ways. First, by providing natural language feedback on one or multiple inputs. This will intelligently auto-tune your descriptions once you hit Save and run extraction. Second, by manually revising the schema descriptions. This will take effect once you hit Save and run extraction.

  3. Use versions to compare or revert to a previous configuration. Click Versions, then click Compare to compare the schema definition of a previous version with the current version. Click Restore to restore a previous version.

Step 3. Use your extraction agent

Once you're happy with the agent's performance, use the agent to extract information.

Click Use Agent in the upper-right. You can choose either:

  • Run in SQL to use the agent to extract information from all your data. This opens a SQL query that uses ai_extract to extract information from your volume or table using the schema defined. For more information on using ai_extract in SQL queries, see ai_extract function.
  • Create a Spark Declarative Pipeline to deploy an ETL pipeline that runs on scheduled intervals to invoke your agent on new data. This creates Lakeflow Spark Declarative Pipelines that updates a streaming table with your extracted data. You can configure the pipeline's schedule to run when new data arrives. For more information on Lakeflow Spark Declarative Pipelines, see Lakeflow Spark Declarative Pipelines.

Limitations

  • Information Extraction agents have a 128k token max context length.
  • Workspaces that have Enhanced Security and Compliance enabled are not supported.
  • Union schema types are not supported.