PDF Field Extraction Configuration Design

Design the extraction configuration for AI-based data extraction from insurance and mortgage PDFs. Covers document type identification, field definition, confidence thresholds, model selection, pre-processing requirements, and structured output schema.

Step 1: Document Type Identification

Before extracting fields, the system must identify what type of document is being processed. Route to the correct extraction model based on document type.

Document classification first: Run a classifier (see form-classifier skill) before the extraction model. The classifier returns a document type label that selects the correct field extraction configuration.

Document types and extraction priority:

Document Type	Extraction Priority	Typical Use
ACORD 80 (personal auto application)	High	Insurance new business
ACORD 125 (commercial applicant)	High	Commercial lines
Policy declarations page	High	Policy verification
Loss run	High	Underwriting
Pay stub	High	Mortgage income verification
W-2	High	Mortgage income verification
1040 tax return	High	Mortgage income + self-employed
1003 Uniform Residential Loan Application	High	Mortgage processing
Bank statement	High	Mortgage asset verification
Explanation of Benefits (EOB)	Medium	Claims / medical
Certificate of insurance	Medium	Proof of coverage
MVR (Motor Vehicle Record)	Medium	Insurance underwriting
Inspection report	Medium	Property insurance

Step 2: Field Definition by Document Type

Pay Stub

Field Name	Data Type	Required	Validation Rule	Extraction Note
employer_name	string	Required	Non-empty	Top of stub; sometimes in logo area
employee_name	string	Required	Non-empty	Matches borrower name on 1003
employee_ssn_last4	string	Optional	4 digits	Often masked; extract if visible
pay_period_start	date	Required	Valid date	Format: varies (MM/DD/YYYY, etc.)
pay_period_end	date	Required	Valid date	Must be more recent than prior stub
pay_date	date	Required	Valid date	Used to determine how current
pay_frequency	enum	Required	Weekly/Bi-Weekly/Semi-Monthly/Monthly	Calculate annual equivalent
gross_pay_this_period	currency	Required	Positive number	Current period gross before deductions
ytd_gross_pay	currency	Required	Positive number	Year-to-date gross
federal_tax_withheld	currency	Optional	Positive number	Cross-check against W-2
net_pay	currency	Optional	Positive number	Informational only
hourly_rate	currency	Optional	If hourly employee	Used to calculate annual
hours_worked	number	Optional	Positive number	If hourly employee

Annual income calculation:

Bi-weekly: ytd_gross / (pay_period_end_week_of_year / 2) × 26
Semi-monthly: ytd_gross / (pay_period_number) × 24
Monthly: gross_pay_this_period × 12
Hourly: hourly_rate × hours_per_week × 52 (use 2-year average if variable)

W-2

Field Name	Box	Data Type	Required	Validation
employee_name	Employee name box	string	Required	Match to application
employer_name	Employer name box	string	Required	Match to pay stubs
employer_ein	b	string	Required	XX-XXXXXXX format
wages_tips_other	1	currency	Required	Primary income figure
federal_income_tax	2	currency	Optional	Cross-check with 1040
social_security_wages	3	currency	Optional	May differ from Box 1
medicare_wages	5	currency	Optional	May differ from Box 1
state	15	string	Optional	State of employment
state_wages	16	currency	Optional	State income
tax_year	top of form	year	Required	Validate: year should be prior 2 years

1040 Tax Return (Federal)

Field Name	Line	Data Type	Required	Note
tax_year	Top of form	year	Required
filing_status	Filing status checkbox	enum	Required	Single/MFJ/MFS/HOH/QW
total_income	9	currency	Required	AGI before deductions
agi	11	currency	Required	Adjusted Gross Income
wages_salaries	1a	currency	Required for W-2 employees
business_income_loss	Schedule C	currency	Required for self-employed	From Schedule C
schedule_c_gross_revenue	Schedule C line 1	currency	Self-employed
schedule_c_net_profit	Schedule C line 31	currency	Self-employed	After expenses
rental_income	Schedule E	currency	If applicable
k1_income	Schedule E Part II	currency	Partnership/S-Corp
depreciation_added_back	Schedule C + E	currency	Self-employed	Non-cash expense added back
depletion_added_back	Schedule C + E	currency	Self-employed	Non-cash expense added back

Bank Statement

Field Name	Data Type	Required	Validation
account_holder_name	string	Required	Match to borrower
institution_name	string	Required	Non-empty
account_number_last4	string	Optional	4 digits (masked)
account_type	enum	Required	Checking/Savings/Money Market
statement_period_start	date	Required	Valid date
statement_period_end	date	Required	Valid date; should be within 60 days
beginning_balance	currency	Required
ending_balance	currency	Required	Used for asset verification
total_deposits	currency	Required	Identifies large/unusual deposits
large_deposits	list	Required	Deposits > $[threshold]; itemized
nsf_count	integer	Optional	Count of NSF/returned items

Policy Declarations Page

Field Name	Data Type	Required	Validation
insured_name	string	Required	Match to client record
policy_number	string	Required	Format varies by carrier
carrier_name	string	Required
lob	enum	Required	Auto/Home/Commercial/GL/etc.
effective_date	date	Required	Valid date
expiration_date	date	Required	After effective date
premium_annual	currency	Required
liability_limit	currency	Required for auto/GL
deductible	currency	Required
property_address	string	Required for property	Match to risk address
vehicle_info	object	Required for auto	Year/Make/Model/VIN

Step 3: Extraction Confidence Thresholds

Confidence Level	Threshold	Handling
High confidence	≥ 0.90	Auto-accept; proceed without human review
Medium confidence	0.70 – 0.89	Flag for human verification; highlight field in review UI
Low confidence	< 0.70	Route to human review queue; display extracted value as suggestion, not fact
Not found	N/A	Mark field as missing; trigger missing-field condition

Field-specific thresholds:

Critical fields (wrong value has significant downstream impact) should have higher thresholds:

Currency amounts: raise auto-accept to ≥ 0.93
Dates: raise auto-accept to ≥ 0.93 (off-by-one errors on dates can affect compliance)
SSN / EIN: raise auto-accept to ≥ 0.95; prefer human verification for all

Confidence aggregation:

Document-level confidence = weighted average of field-level confidence scores (weight by field importance)
If document-level confidence < 0.75: route entire document to human review, not field-by-field

Step 4: Model Selection

Document Type	Recommended Model	Rationale
Machine-printed structured forms (W-2, 1099)	Azure Document Intelligence (Form Recognizer) — prebuilt W-2/1099 model	Pre-built models for standard IRS forms; high accuracy
Semi-structured machine print (pay stubs, bank statements, dec pages)	Azure Document Intelligence — custom trained model OR AWS Textract with custom adapter	Requires training on carrier/issuer-specific layouts
Handwritten fields (ACORD applications, older inspection reports)	Azure Document Intelligence — read model for handwriting	Handles mixed print/handwrite; lower accuracy than machine print
Tables (bank statement transactions, loss run schedules)	AWS Textract — Tables API OR Azure DI table extraction	Preserves row/column structure; critical for transaction lists
Complex multi-page documents (1040 with schedules)	Azure Document Intelligence — custom model with schedule awareness	Multi-page layout with dynamic presence of schedules
Low-quality scans (high noise, skew, faded)	Pre-process then OCR (Tesseract or Azure DI read)	Pre-processing pipeline required before model

Model selection criteria:

Use Azure Document Intelligence prebuilt models first (W-2, 1099, invoice) — highest accuracy for covered types
For document types without a prebuilt model: train a custom Azure DI model using at least 50 labeled examples per document type (100+ preferred)
AWS Textract as alternative if client is AWS-native and Azure DI is not preferred
Never use general-purpose OCR (Tesseract) alone for structured financial documents — accuracy is insufficient for regulatory contexts

Step 5: Pre-Processing Requirements

Before sending to extraction model:

Pre-Processing Step	When Required	Tool
DPI check	If source is scanned document	Reject if < 150 DPI; warn if < 200 DPI; optimal 300+ DPI
De-skew (deskew)	If page is rotated or tilted	OpenCV deskew or Azure DI handles internally
Contrast enhancement	If page is faded or low contrast	Adaptive histogram equalization
De-noise	If scanned with heavy grain	Gaussian blur or median filter
Color normalization	Color scans — convert to grayscale or enhance	Improves OCR accuracy
Page splitting	Multi-document packets	Detect and split at page boundaries between documents
Page rotation	If individual pages are upside-down	Auto-detect and rotate using text direction
Watermark removal	If watermarks obscure content	Detect and suppress watermark layer

Quality gate: Any document failing minimum quality thresholds (DPI < 150, or confidence after pre-processing below floor) is routed to manual entry queue with specific quality failure message.

Step 6: Structured Output Schema

All extraction results output in consistent JSON format for downstream system consumption.

{
  "extraction_id": "uuid",
  "document_id": "uuid",
  "document_type": "pay_stub",
  "extraction_timestamp": "2024-01-15T14:32:00Z",
  "model_name": "azure-di-custom-pay-stub-v2",
  "model_version": "2.1.0",
  "document_confidence": 0.91,
  "page_count": 2,
  "fields": {
    "employer_name": {
      "value": "Acme Corporation",
      "confidence": 0.97,
      "bounding_box": {"page": 1, "x": 120, "y": 45, "width": 200, "height": 20},
      "status": "auto_accepted"
    },
    "gross_pay_this_period": {
      "value": 4250.00,
      "value_type": "currency",
      "confidence": 0.89,
      "bounding_box": {"page": 1, "x": 450, "y": 310, "width": 80, "height": 18},
      "status": "flagged_for_review"
    }
  },
  "missing_fields": ["employee_ssn_last4"],
  "validation_results": {
    "pay_period_end_vs_today": "within_60_days",
    "ytd_gross_ge_period_gross": "pass",
    "cross_field_consistency": "pass"
  },
  "routing": "auto_process",
  "review_reasons": ["gross_pay_confidence_below_threshold"]
}

Output Format

Deliver two artifacts:

Extraction Configuration Specification — For each document type in scope: field definition table (name, type, required, validation, extraction note), confidence thresholds, model selection, and pre-processing requirements
Output Schema Documentation — JSON schema for extraction results with field definitions, status enum values, routing logic, and validation rule definitions