Skip to main content

AI Builder: Information extraction

Beta

This feature is in Beta.

This article describes how to create a generative AI agent for information extraction using AI Builder: Information extraction.

What is AI Builder: Information extraction?

AI Builder provides a simple, no-code approach to build and optimize domain-specific, high-quality AI agent systems for common AI use cases. AI Builder supports information extraction and simplifies the process of transforming a large volume of unlabeled text documents into a structured table with extracted information for each document.

Examples of information extraction include:

  • Extracting prices and lease information from contracts.
  • Organizing data from customer notes.
  • Getting important details from news articles.

AI Builder: Information extraction leverages automated evaluation capabilities, including MLflow and Agent Evaluation, to enable rapid assessment of the cost-quality tradeoff for your specific extraction task. This assessment allows you to make informed decisions about the balance between accuracy and resource investment.

Requirements

  • Serverless-supported workspace that includes the following:
    • Unity Catalog enabled on your workspace.
    • A workspace in one of the supported regions: us-east-1 or us-west-2.
    • Access to foundation models in Unity Catalog through the system.ai schema.
    • Access to a serverless budget policy with a nonzero budget.
  • Ability to use the ai_query SQL function.
  • Files that you want to extract data from. The files must be in a Unity Catalog volume or table.
    • To build your agent you need at least 10 unlabeled documents in your Unity Catalog volume or 10 rows in your table.
    • To optimize your agent ((Optional) Step 4: Review and deploy an optimized agent), you must have at least 75 unlabeled documents in your Unity Catalog volume or at least 75 rows in your table.

Create an information extraction agent

Go to AI Builder in the left navigation pane of your workspace and click Information Extraction.

AI Builder: Key information extraction

Step 1: Add input data and output example

On the Configure tab, click Show an example > to expand an example input and model response for an information extraction agent.

In the pane below, configure your agent:

  1. In the Source documents field, select the folder or table you want to use from your Unity Catalog volume. If you selected a table, select the column containing your text data from the dropdown.

    The folder must contain documents in a supported document format and the table column must contain data in a supported data format. This dataset is used to create your agent.

    The following is an example volume:

    /Volumes/main/info-extraction/bbc_articles/

  2. In the Sample output field, provide an example response:

    JSON
    {
    "title": "Economy Slides to Recession",
    "category": "Politics",
    "paragraphs": [
    {
    "summary": "GDP fell by 0.1% in the last three months of 2004.",
    "word_count": 38
    },
    {
    "summary": "Consumer spending had been depressed by one-off factors such as the unseasonably mild winter.",
    "word_count": 42
    }
    ],
    "tags": ["Recession", "Economy", "Consumer Spending"],
    "estimate_time_to_read_min": 1,
    "published_date": "2005-01-15",
    "needs_review": false
    }
  3. Provide a name for your agent. You can leave the default name if you don't want to change it.

  4. Select Create agent.

Supported document formats

The following table shows the supported document file types for your source documents if you provide a Unity Catalog volume.

Code files

Document files

Log files

  • .c
  • .cc
  • .cpp
  • .cs
  • .css
  • .cxx
  • .go
  • .h
  • .hpp
  • .htm
  • .html
  • .java
  • .js
  • .json
  • .jsonl
  • .jsx
  • .lua
  • .md
  • .php
  • .pl
  • .py
  • .rb
  • .sh
  • .swift
  • .tex
  • .ts
  • .tsx
  • .md
  • .rst
  • .tex
  • .txt
  • .xml
  • .xsd
  • .xsl
  • .diff
  • .err
  • .log
  • .out
  • .patch

Supported data formats

AI Builder supports the following data types and schemas for your source documents if you provide a Unity Catalog table. AI Builder can also extract these data types from each document.

  • str
  • int
  • float
  • boolean
  • Custom nested fields
  • Arrays of the above data types

Step 2: Build and improve your agent

On the Agent configuration pane,

  1. (Optional) Add global instructions for your agent, such as a prompt that can apply to all fields.

  2. Adjust the descriptions of the schema fields that you want your agent to use for output responses. These descriptions are what the agent relies on to understand what you want to extract.

    Agent configuration pane on the Build tab of AI Builder: Information Extraction.

On the Improve your agent pane,

  1. Review model output examples based on the specifications provided for each field.

  2. Review the Databricks recommendations for optimizing agent performance.

  3. Apply recommendations and adjust your descriptions and instructions on the Agent configuration pane as needed.

    Improve your agent pane on the Build tab of AI Builder: Information Extraction.

  4. After you apply changes and recommendations, select Update agent to save those changes to your agent. The Improve your agent pane updates to show new example model output. The recommendations on this pane do not update.

Now you have an agent for information extraction.

Step 3: Use your agent

You can use your agent in workflows across Databricks.

On the Use tab,

  1. Select Start extraction to open the SQL editor and use ai_query to send requests to your new information extraction agent.

  2. (Optional) Select Optimize if you want to optimize your agent for cost.

    • Optimization can take about an hour.
    • Making changes to your currently active agent is blocked when optimization is in progress.

When optimization completes, you are directed to the Review tab to view a comparison of your currently active agent and an agent optimized for cost. See (Optional) Step 4: Review and deploy an optimized agent.

Extract data for all documents tile and the Optimize agent performance tile on the Use tab of AI Builder: Information extraction

(Optional) Step 4: Review and deploy an optimized agent

When you select Optimize on the Use tab, Databricks compares multiple different optimization strategies to build and recommend an optimized agent. These strategies include Foundation Model Fine-tuning which uses Databricks Geos.

On the Review tab,

  1. In Evaluation results, you can visually compare the optimized agent and your active agent. To perform evaluation, Databricks chooses a metric based on each field's data type and uses an evaluation data set to compare your active agent and the agent optimized for cost. This evaluation set is based on a subset of the data you used to create your original agent.

    1. Metrics are logged to your MLflow run per-field (aggregated to the top-level field).
    2. Select the overall_score and is_schema_match columns from the Columns drop-down.
  2. After you review these results, click Deploy if you want to deploy this optimized agent instead of your currently active agent.

Limitations

  • Databricks recommends at least 1000 documents to optimize your agent. When you add more documents, the knowledge base that the agent can learn from increases, which improves agent quality and its extraction accuracy.
  • If your source documents include a file larger than 3 MB, agent creation will fail.
  • Documents larger than 64 KB might be skipped during agent building.
  • The input and output limit is 128K tokens.
  • Workspaces that use PrivateLink, including storage behind PrivateLink, are not supported.
  • Union schema types are not supported.