PDF Field Extraction Configuration Design
Design the extraction configuration for AI-based data extraction from insurance and mortgage PDFs. Covers document type identification, field definition, confidence thresholds, model selection, pre-processing requirements, and structured output schema.
Step 1: Document Type Identification
Before extracting fields, the system must identify what type of document is being processed. Route to the correct extraction model based on document type.
Document classification first: Run a classifier (see form-classifier skill) before the extraction model. The classifier returns a document type label that selects the correct field extraction configuration.
Document types and extraction priority:
| Document Type | Extraction Priority | Typical Use |
|---|---|---|
| ACORD 80 (personal auto application) | High | Insurance new business |
| ACORD 125 (commercial applicant) | High | Commercial lines |
| Policy declarations page | High | Policy verification |
| Loss run | High | Underwriting |
| Pay stub | High | Mortgage income verification |
| W-2 | High | Mortgage income verification |
| 1040 tax return | High | Mortgage income + self-employed |
| 1003 Uniform Residential Loan Application | High | Mortgage processing |
| Bank statement | High | Mortgage asset verification |
| Explanation of Benefits (EOB) | Medium | Claims / medical |
| Certificate of insurance | Medium | Proof of coverage |
| MVR (Motor Vehicle Record) | Medium | Insurance underwriting |
| Inspection report | Medium | Property insurance |
Step 2: Field Definition by Document Type
Pay Stub
| Field Name | Data Type | Required | Validation Rule | Extraction Note |
|---|---|---|---|---|
| employer_name | string | Required | Non-empty | Top of stub; sometimes in logo area |
| employee_name | string | Required | Non-empty | Matches borrower name on 1003 |
| employee_ssn_last4 | string | Optional | 4 digits | Often masked; extract if visible |
| pay_period_start | date | Required | Valid date | Format: varies (MM/DD/YYYY, etc.) |
| pay_period_end | date | Required | Valid date | Must be more recent than prior stub |
| pay_date | date | Required | Valid date | Used to determine how current |
| pay_frequency | enum | Required | Weekly/Bi-Weekly/Semi-Monthly/Monthly | Calculate annual equivalent |
| gross_pay_this_period | currency | Required | Positive number | Current period gross before deductions |
| ytd_gross_pay | currency | Required | Positive number | Year-to-date gross |
| federal_tax_withheld | currency | Optional | Positive number | Cross-check against W-2 |
| net_pay | currency | Optional | Positive number | Informational only |
| hourly_rate | currency | Optional | If hourly employee | Used to calculate annual |
| hours_worked | number | Optional | Positive number | If hourly employee |
Annual income calculation:
Bi-weekly: ytd_gross / (pay_period_end_week_of_year / 2) × 26
Semi-monthly: ytd_gross / (pay_period_number) × 24
Monthly: gross_pay_this_period × 12
Hourly: hourly_rate × hours_per_week × 52 (use 2-year average if variable)
W-2
| Field Name | Box | Data Type | Required | Validation |
|---|---|---|---|---|
| employee_name | Employee name box | string | Required | Match to application |
| employer_name | Employer name box | string | Required | Match to pay stubs |
| employer_ein | b | string | Required | XX-XXXXXXX format |
| wages_tips_other | 1 | currency | Required | Primary income figure |
| federal_income_tax | 2 | currency | Optional | Cross-check with 1040 |
| social_security_wages | 3 | currency | Optional | May differ from Box 1 |
| medicare_wages | 5 | currency | Optional | May differ from Box 1 |
| state | 15 | string | Optional | State of employment |
| state_wages | 16 | currency | Optional | State income |
| tax_year | top of form | year | Required | Validate: year should be prior 2 years |
1040 Tax Return (Federal)
| Field Name | Line | Data Type | Required | Note |
|---|---|---|---|---|
| tax_year | Top of form | year | Required | |
| filing_status | Filing status checkbox | enum | Required | Single/MFJ/MFS/HOH/QW |
| total_income | 9 | currency | Required | AGI before deductions |
| agi | 11 | currency | Required | Adjusted Gross Income |
| wages_salaries | 1a | currency | Required for W-2 employees | |
| business_income_loss | Schedule C | currency | Required for self-employed | From Schedule C |
| schedule_c_gross_revenue | Schedule C line 1 | currency | Self-employed | |
| schedule_c_net_profit | Schedule C line 31 | currency | Self-employed | After expenses |
| rental_income | Schedule E | currency | If applicable | |
| k1_income | Schedule E Part II | currency | Partnership/S-Corp | |
| depreciation_added_back | Schedule C + E | currency | Self-employed | Non-cash expense added back |
| depletion_added_back | Schedule C + E | currency | Self-employed | Non-cash expense added back |
Bank Statement
| Field Name | Data Type | Required | Validation |
|---|---|---|---|
| account_holder_name | string | Required | Match to borrower |
| institution_name | string | Required | Non-empty |
| account_number_last4 | string | Optional | 4 digits (masked) |
| account_type | enum | Required | Checking/Savings/Money Market |
| statement_period_start | date | Required | Valid date |
| statement_period_end | date | Required | Valid date; should be within 60 days |
| beginning_balance | currency | Required | |
| ending_balance | currency | Required | Used for asset verification |
| total_deposits | currency | Required | Identifies large/unusual deposits |
| large_deposits | list | Required | Deposits > $[threshold]; itemized |
| nsf_count | integer | Optional | Count of NSF/returned items |
Policy Declarations Page
| Field Name | Data Type | Required | Validation |
|---|---|---|---|
| insured_name | string | Required | Match to client record |
| policy_number | string | Required | Format varies by carrier |
| carrier_name | string | Required | |
| lob | enum | Required | Auto/Home/Commercial/GL/etc. |
| effective_date | date | Required | Valid date |
| expiration_date | date | Required | After effective date |
| premium_annual | currency | Required | |
| liability_limit | currency | Required for auto/GL | |
| deductible | currency | Required | |
| property_address | string | Required for property | Match to risk address |
| vehicle_info | object | Required for auto | Year/Make/Model/VIN |
Step 3: Extraction Confidence Thresholds
| Confidence Level | Threshold | Handling |
|---|---|---|
| High confidence | ≥ 0.90 | Auto-accept; proceed without human review |
| Medium confidence | 0.70 – 0.89 | Flag for human verification; highlight field in review UI |
| Low confidence | < 0.70 | Route to human review queue; display extracted value as suggestion, not fact |
| Not found | N/A | Mark field as missing; trigger missing-field condition |
Field-specific thresholds:
Critical fields (wrong value has significant downstream impact) should have higher thresholds:
- Currency amounts: raise auto-accept to ≥ 0.93
- Dates: raise auto-accept to ≥ 0.93 (off-by-one errors on dates can affect compliance)
- SSN / EIN: raise auto-accept to ≥ 0.95; prefer human verification for all
Confidence aggregation:
- Document-level confidence = weighted average of field-level confidence scores (weight by field importance)
- If document-level confidence < 0.75: route entire document to human review, not field-by-field
Step 4: Model Selection
| Document Type | Recommended Model | Rationale |
|---|---|---|
| Machine-printed structured forms (W-2, 1099) | Azure Document Intelligence (Form Recognizer) — prebuilt W-2/1099 model | Pre-built models for standard IRS forms; high accuracy |
| Semi-structured machine print (pay stubs, bank statements, dec pages) | Azure Document Intelligence — custom trained model OR AWS Textract with custom adapter | Requires training on carrier/issuer-specific layouts |
| Handwritten fields (ACORD applications, older inspection reports) | Azure Document Intelligence — read model for handwriting | Handles mixed print/handwrite; lower accuracy than machine print |
| Tables (bank statement transactions, loss run schedules) | AWS Textract — Tables API OR Azure DI table extraction | Preserves row/column structure; critical for transaction lists |
| Complex multi-page documents (1040 with schedules) | Azure Document Intelligence — custom model with schedule awareness | Multi-page layout with dynamic presence of schedules |
| Low-quality scans (high noise, skew, faded) | Pre-process then OCR (Tesseract or Azure DI read) | Pre-processing pipeline required before model |
Model selection criteria:
- Use Azure Document Intelligence prebuilt models first (W-2, 1099, invoice) — highest accuracy for covered types
- For document types without a prebuilt model: train a custom Azure DI model using at least 50 labeled examples per document type (100+ preferred)
- AWS Textract as alternative if client is AWS-native and Azure DI is not preferred
- Never use general-purpose OCR (Tesseract) alone for structured financial documents — accuracy is insufficient for regulatory contexts
Step 5: Pre-Processing Requirements
Before sending to extraction model:
| Pre-Processing Step | When Required | Tool |
|---|---|---|
| DPI check | If source is scanned document | Reject if < 150 DPI; warn if < 200 DPI; optimal 300+ DPI |
| De-skew (deskew) | If page is rotated or tilted | OpenCV deskew or Azure DI handles internally |
| Contrast enhancement | If page is faded or low contrast | Adaptive histogram equalization |
| De-noise | If scanned with heavy grain | Gaussian blur or median filter |
| Color normalization | Color scans — convert to grayscale or enhance | Improves OCR accuracy |
| Page splitting | Multi-document packets | Detect and split at page boundaries between documents |
| Page rotation | If individual pages are upside-down | Auto-detect and rotate using text direction |
| Watermark removal | If watermarks obscure content | Detect and suppress watermark layer |
Quality gate: Any document failing minimum quality thresholds (DPI < 150, or confidence after pre-processing below floor) is routed to manual entry queue with specific quality failure message.
Step 6: Structured Output Schema
All extraction results output in consistent JSON format for downstream system consumption.
{
"extraction_id": "uuid",
"document_id": "uuid",
"document_type": "pay_stub",
"extraction_timestamp": "2024-01-15T14:32:00Z",
"model_name": "azure-di-custom-pay-stub-v2",
"model_version": "2.1.0",
"document_confidence": 0.91,
"page_count": 2,
"fields": {
"employer_name": {
"value": "Acme Corporation",
"confidence": 0.97,
"bounding_box": {"page": 1, "x": 120, "y": 45, "width": 200, "height": 20},
"status": "auto_accepted"
},
"gross_pay_this_period": {
"value": 4250.00,
"value_type": "currency",
"confidence": 0.89,
"bounding_box": {"page": 1, "x": 450, "y": 310, "width": 80, "height": 18},
"status": "flagged_for_review"
}
},
"missing_fields": ["employee_ssn_last4"],
"validation_results": {
"pay_period_end_vs_today": "within_60_days",
"ytd_gross_ge_period_gross": "pass",
"cross_field_consistency": "pass"
},
"routing": "auto_process",
"review_reasons": ["gross_pay_confidence_below_threshold"]
}
Output Format
Deliver two artifacts:
Extraction Configuration Specification — For each document type in scope: field definition table (name, type, required, validation, extraction note), confidence thresholds, model selection, and pre-processing requirements
Output Schema Documentation — JSON schema for extraction results with field definitions, status enum values, routing logic, and validation rule definitions