Use Agent Bricks: Information Extraction
This feature is in Beta.
This article describes how to create a generative AI agent for information extraction using Agent Bricks: Information Extraction.
Agent Bricks provides a simple approach to build and optimize domain-specific, high-quality AI agent systems for common AI use cases.
What is Agent Bricks: Information Extraction?
Agent Bricks supports information extraction and simplifies the process of transforming a large volume of unlabeled text documents into a structured table with extracted information for each document.
Examples of information extraction include:
- Extracting prices and lease information from contracts.
- Organizing data from customer notes.
- Getting important details from news articles.
Agent Bricks: Information Extraction leverages automated evaluation capabilities, including MLflow and Agent Evaluation, to enable rapid assessment of the cost-quality tradeoff for your specific extraction task. This assessment allows you to make informed decisions about the balance between accuracy and resource investment.
Requirements
- A workspace that includes the following:
- Mosaic AI Agent Bricks Preview (Beta) enabled. See Manage Databricks Previews.
- Serverless compute enabled. See Enable serverless compute.
- Unity Catalog enabled. See Enable a workspace for Unity Catalog.
- A workspace in one of the supported regions:
us-east-1
orus-west-2
. - Access to foundation models in Unity Catalog through the
system.ai
schema. - Access to a serverless budget policy with a nonzero budget.
- Ability to use the
ai_query
SQL function. - Files that you want to extract data from. The files must be in a Unity Catalog volume or table.
- If you want to use PDFs, convert them to a Unity Catalog table first. See Use PDFs in Agent Bricks.
- To build your agent, you need at least 1 unlabeled document in your Unity Catalog volume or 1 row in your table.
- To optimize your agent ((Optional) Step 4: Review and deploy an optimized agent), you must have at least 75 unlabeled documents in your Unity Catalog volume or at least 75 rows in your table.
Create an information extraction agent
Go to Agents in the left navigation pane of your workspace and click Information Extraction.
Step 1: Configure your agent
On the Configure tab, click Show an example > to expand an example input and model response for an information extraction agent.
In the pane below, configure your agent:
-
In the Name field, enter a name for your agent.
-
Select the type of data you want to provide. You can choose either Unlabeled dataset or Labeled dataset.
-
Select the dataset to provide.
- Unlabeled dataset
- Labeled dataset
If you select Unlabeled dataset:
-
In the Dataset location field, select the folder or table you want to use from your Unity Catalog volume. If you select a folder, the folder must contain documents in a supported document format.
-
If you're providing a table, select the column containing your text data from the dropdown. The table column must contain data in a supported data format.
If you want to use PDFs, convert them to a Unity Catalog table first. See Use PDFs in Agent Bricks.
The following is an example volume:
/Volumes/main/info-extraction/bbc_articles/
If you select Labeled dataset:
-
In the Labeled training dataset field, select the Unity Catalog table you want to use.
-
In the Input column field, select the column containing the text you want the agent to process. The data in this column must be in
str
format. -
In the Labeled response column field, select the column containing the labeled response you want the agent to generate. The data in this column must be a JSON string. Each row in this column must follow the same JSON format. Rows containing additional or missing keys are not acceptable.
On optimize, Agent Bricks uses the labeled data to improve the quality of the Information Extraction endpoint.
-
If you provided an unlabeled dataset, Agent Bricks automatically infers and generates a sample JSON output containing data extracted from your dataset in the Sample JSON output field. You can accept the sample output, edit it, or replace it with an example of your desired JSON output. The agent returns extracted information using this format.
If you provided a labeled dataset, the Sample JSON output field shows the first row of data from the labeled response column. Verify this JSON output matches the expected format.
For example, the following sample JSON output might be used to extract information from a set of news articles:
JSON{
"title": "Economy Slides to Recession",
"category": "Politics",
"paragraphs": [
{
"summary": "GDP fell by 0.1% in the last three months of 2004.",
"word_count": 38
},
{
"summary": "Consumer spending had been depressed by one-off factors such as the unseasonably mild winter.",
"word_count": 42
}
],
"tags": ["Recession", "Economy", "Consumer Spending"],
"estimate_time_to_read_min": 1,
"published_date": "2005-01-15",
"needs_review": false
} -
Click Create agent.
Supported document formats
The following table shows the supported document file types for your source documents if you provide a Unity Catalog volume.
Code files | Document files | Log files |
---|---|---|
|
|
|
Supported data formats
Agent Bricks: Information Extraction supports the following data types and schemas for your source documents if you provide a Unity Catalog table. Agent Bricks can also extract these data types from each document.
str
int
float
boolean
- Custom nested fields
- Arrays of the above data types
Step 2: Build and improve your agent
On the Build tab, in the Agent configuration pane, refine your schema definition for better results.
-
(Optional) Add global instructions for your agent, such as a prompt that can apply to all fields.
-
Adjust the descriptions of the schema fields that you want your agent to use for output responses. These descriptions are what the agent relies on to understand what you want to extract.
-
Click Update agent.
On the left side of the Build tab, review recommendations and sample outputs.
-
Review model output examples based on the specifications provided for each field.
-
Review the Databricks recommendations for optimizing agent performance.
-
Apply recommendations and adjust your descriptions and instructions on the Agent configuration pane as needed.
-
After you apply changes and recommendations, select Update agent to save those changes to your agent. The Improve your agent pane updates to show new example model output. The recommendations on this pane do not update.
Now you have an agent for information extraction.
Step 3: Use your agent
You can use your agent in workflows across Databricks. By default, Agent Bricks endpoints scale to zero after 3 days of inactivity, so you'll only be billed for the uptime.
On the Use tab,
-
Select Start extraction to open the SQL editor and use
ai_query
to send requests to your new information extraction agent. -
(Optional) Select Optimize if you want to optimize your agent for cost.
- Optimization requires at least 75 files.
- Optimization can take about an hour.
- Making changes to your currently active agent is blocked when optimization is in progress.
When optimization completes, you are directed to the Review tab to view a comparison of your currently active agent and an agent optimized for cost. See (Optional) Step 4: Review and deploy an optimized agent.
- (Optional) Select Create pipeline to deploy a pipeline that runs at scheduled intervals to use your agent on new data. See Lakeflow Declarative Pipelines for more information about pipelines.
(Optional) Step 4: Review and deploy an optimized agent
When you select Optimize on the Use tab, Databricks compares multiple different optimization strategies to build and recommend an optimized agent. These strategies include Foundation Model Fine-tuning which uses Databricks Geos.
On the Review tab,
-
In Evaluation results, you can visually compare the optimized agent and your active agent. To perform evaluation, Databricks chooses a metric based on each field's data type and uses an evaluation data set to compare your active agent and the agent optimized for cost. This evaluation set is based on a subset of the data you used to create your original agent.
- Metrics are logged to your MLflow run per-field (aggregated to the top-level field).
- Select the
overall_score
andis_schema_match
columns from the Columns drop-down.
-
After you review these results, click Deploy if you want to deploy this optimized agent instead of your currently active agent.
Query the agent endpoint
There are multiple ways to query the created knowledge assistant endpoint. Use the code examples provided in AI Playground as a starting point.
- On the Configure tab, click Open in playground.
- From Playground, click Get code.
- Choose how you want to use the endpoint:
- Select Apply on data to create a SQL query that applies the agent to a specific table column.
- Select Curl API for a code example to query the endpoint using curl.
- Select Python API for a code example to interact with the endpoint using Python.
Use PDFs in Agent Bricks
PDFs are not yet supported natively in Agent Bricks: Information Extraction and Custom LLM. However, you can use Agent Brick's UI workflow to convert a folder of PDF files into markdown, then use the resulting Unity Catalog table as input when building your agent. This workflow uses ai_parse_document
for the conversion. Follow these steps:
-
Click Agents in the left navigation pane to open Agent Bricks in Databricks.
-
Within the Information Extraction or Custom LLM use cases, click Use PDFs.
-
In the side panel that opens, enter the following fields to create a new workflow to convert your PDFs:
- Select folder with PDFs: Select the Unity Catalog folder containing the PDFs you want to use.
- Select destination table: Select the destination schema for the converted markdown table and, optionally, adjust the table name in the field below.
- Select active SQL warehouse: Select the SQL warehouse to run the workflow.
-
Click Start import.
-
You will be redirected to the All workflows tab, which lists all of your PDF workflows. Use this tab to monitor the status of your jobs.
If your workflow fails, click on the job name to open it and view error messages to help you debug.
-
When your workflow has completed successfully, click on the job name to open the table in Catalog Explorer to explore and understand the columns.
-
Use the Unity Catalog table as input data in Agent Bricks when configuring your agent.
Limitations
- Databricks requires at least 75 documents to optimize your agent. For better optimization results, at least 1000 documents is recommended. When you add more documents, the knowledge base that the agent can learn from increases, which improves agent quality and its extraction accuracy.
- Information Extraction agents have a 128k token max context length.
- Workspaces that use PrivateLink, including storage behind PrivateLink, are not supported.
- Workspaces that have Enhanced Security and Compliance enabled are not supported.
- Union schema types are not supported.