Skip to main content
Generalmarkus41

pdf-extractor

Design AI extraction configurations for structured and semi-structured insurance and mortgage PDFs. Use when configuring a document AI model to extract data fields from applications, policy documents, EOBs, pay stubs, tax returns, or bank statements.

Stars
12
Source
markus41/claude
Updated
2026-05-11
Slug
markus41--claude--pdf-extractor
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/markus41/claude/HEAD/plugins/lobbi-document-intelligence/skills/pdf-extractor/SKILL.md -o .claude/skills/pdf-extractor.md

Drops the SKILL.md into .claude/skills/pdf-extractor.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

PDF Field Extraction Configuration Design

Design the extraction configuration for AI-based data extraction from insurance and mortgage PDFs. Covers document type identification, field definition, confidence thresholds, model selection, pre-processing requirements, and structured output schema.


Step 1: Document Type Identification

Before extracting fields, the system must identify what type of document is being processed. Route to the correct extraction model based on document type.

Document classification first: Run a classifier (see form-classifier skill) before the extraction model. The classifier returns a document type label that selects the correct field extraction configuration.

Document types and extraction priority:

Document Type Extraction Priority Typical Use
ACORD 80 (personal auto application) High Insurance new business
ACORD 125 (commercial applicant) High Commercial lines
Policy declarations page High Policy verification
Loss run High Underwriting
Pay stub High Mortgage income verification
W-2 High Mortgage income verification
1040 tax return High Mortgage income + self-employed
1003 Uniform Residential Loan Application High Mortgage processing
Bank statement High Mortgage asset verification
Explanation of Benefits (EOB) Medium Claims / medical
Certificate of insurance Medium Proof of coverage
MVR (Motor Vehicle Record) Medium Insurance underwriting
Inspection report Medium Property insurance

Step 2: Field Definition by Document Type

Pay Stub

Field Name Data Type Required Validation Rule Extraction Note
employer_name string Required Non-empty Top of stub; sometimes in logo area
employee_name string Required Non-empty Matches borrower name on 1003
employee_ssn_last4 string Optional 4 digits Often masked; extract if visible
pay_period_start date Required Valid date Format: varies (MM/DD/YYYY, etc.)
pay_period_end date Required Valid date Must be more recent than prior stub
pay_date date Required Valid date Used to determine how current
pay_frequency enum Required Weekly/Bi-Weekly/Semi-Monthly/Monthly Calculate annual equivalent
gross_pay_this_period currency Required Positive number Current period gross before deductions
ytd_gross_pay currency Required Positive number Year-to-date gross
federal_tax_withheld currency Optional Positive number Cross-check against W-2
net_pay currency Optional Positive number Informational only
hourly_rate currency Optional If hourly employee Used to calculate annual
hours_worked number Optional Positive number If hourly employee

Annual income calculation:

Bi-weekly: ytd_gross / (pay_period_end_week_of_year / 2) × 26
Semi-monthly: ytd_gross / (pay_period_number) × 24
Monthly: gross_pay_this_period × 12
Hourly: hourly_rate × hours_per_week × 52 (use 2-year average if variable)

W-2

Field Name Box Data Type Required Validation
employee_name Employee name box string Required Match to application
employer_name Employer name box string Required Match to pay stubs
employer_ein b string Required XX-XXXXXXX format
wages_tips_other 1 currency Required Primary income figure
federal_income_tax 2 currency Optional Cross-check with 1040
social_security_wages 3 currency Optional May differ from Box 1
medicare_wages 5 currency Optional May differ from Box 1
state 15 string Optional State of employment
state_wages 16 currency Optional State income
tax_year top of form year Required Validate: year should be prior 2 years

1040 Tax Return (Federal)

Field Name Line Data Type Required Note
tax_year Top of form year Required
filing_status Filing status checkbox enum Required Single/MFJ/MFS/HOH/QW
total_income 9 currency Required AGI before deductions
agi 11 currency Required Adjusted Gross Income
wages_salaries 1a currency Required for W-2 employees
business_income_loss Schedule C currency Required for self-employed From Schedule C
schedule_c_gross_revenue Schedule C line 1 currency Self-employed
schedule_c_net_profit Schedule C line 31 currency Self-employed After expenses
rental_income Schedule E currency If applicable
k1_income Schedule E Part II currency Partnership/S-Corp
depreciation_added_back Schedule C + E currency Self-employed Non-cash expense added back
depletion_added_back Schedule C + E currency Self-employed Non-cash expense added back

Bank Statement

Field Name Data Type Required Validation
account_holder_name string Required Match to borrower
institution_name string Required Non-empty
account_number_last4 string Optional 4 digits (masked)
account_type enum Required Checking/Savings/Money Market
statement_period_start date Required Valid date
statement_period_end date Required Valid date; should be within 60 days
beginning_balance currency Required
ending_balance currency Required Used for asset verification
total_deposits currency Required Identifies large/unusual deposits
large_deposits list Required Deposits > $[threshold]; itemized
nsf_count integer Optional Count of NSF/returned items

Policy Declarations Page

Field Name Data Type Required Validation
insured_name string Required Match to client record
policy_number string Required Format varies by carrier
carrier_name string Required
lob enum Required Auto/Home/Commercial/GL/etc.
effective_date date Required Valid date
expiration_date date Required After effective date
premium_annual currency Required
liability_limit currency Required for auto/GL
deductible currency Required
property_address string Required for property Match to risk address
vehicle_info object Required for auto Year/Make/Model/VIN

Step 3: Extraction Confidence Thresholds

Confidence Level Threshold Handling
High confidence ≥ 0.90 Auto-accept; proceed without human review
Medium confidence 0.70 – 0.89 Flag for human verification; highlight field in review UI
Low confidence < 0.70 Route to human review queue; display extracted value as suggestion, not fact
Not found N/A Mark field as missing; trigger missing-field condition

Field-specific thresholds:

Critical fields (wrong value has significant downstream impact) should have higher thresholds:

  • Currency amounts: raise auto-accept to ≥ 0.93
  • Dates: raise auto-accept to ≥ 0.93 (off-by-one errors on dates can affect compliance)
  • SSN / EIN: raise auto-accept to ≥ 0.95; prefer human verification for all

Confidence aggregation:

  • Document-level confidence = weighted average of field-level confidence scores (weight by field importance)
  • If document-level confidence < 0.75: route entire document to human review, not field-by-field

Step 4: Model Selection

Document Type Recommended Model Rationale
Machine-printed structured forms (W-2, 1099) Azure Document Intelligence (Form Recognizer) — prebuilt W-2/1099 model Pre-built models for standard IRS forms; high accuracy
Semi-structured machine print (pay stubs, bank statements, dec pages) Azure Document Intelligence — custom trained model OR AWS Textract with custom adapter Requires training on carrier/issuer-specific layouts
Handwritten fields (ACORD applications, older inspection reports) Azure Document Intelligence — read model for handwriting Handles mixed print/handwrite; lower accuracy than machine print
Tables (bank statement transactions, loss run schedules) AWS Textract — Tables API OR Azure DI table extraction Preserves row/column structure; critical for transaction lists
Complex multi-page documents (1040 with schedules) Azure Document Intelligence — custom model with schedule awareness Multi-page layout with dynamic presence of schedules
Low-quality scans (high noise, skew, faded) Pre-process then OCR (Tesseract or Azure DI read) Pre-processing pipeline required before model

Model selection criteria:

  • Use Azure Document Intelligence prebuilt models first (W-2, 1099, invoice) — highest accuracy for covered types
  • For document types without a prebuilt model: train a custom Azure DI model using at least 50 labeled examples per document type (100+ preferred)
  • AWS Textract as alternative if client is AWS-native and Azure DI is not preferred
  • Never use general-purpose OCR (Tesseract) alone for structured financial documents — accuracy is insufficient for regulatory contexts

Step 5: Pre-Processing Requirements

Before sending to extraction model:

Pre-Processing Step When Required Tool
DPI check If source is scanned document Reject if < 150 DPI; warn if < 200 DPI; optimal 300+ DPI
De-skew (deskew) If page is rotated or tilted OpenCV deskew or Azure DI handles internally
Contrast enhancement If page is faded or low contrast Adaptive histogram equalization
De-noise If scanned with heavy grain Gaussian blur or median filter
Color normalization Color scans — convert to grayscale or enhance Improves OCR accuracy
Page splitting Multi-document packets Detect and split at page boundaries between documents
Page rotation If individual pages are upside-down Auto-detect and rotate using text direction
Watermark removal If watermarks obscure content Detect and suppress watermark layer

Quality gate: Any document failing minimum quality thresholds (DPI < 150, or confidence after pre-processing below floor) is routed to manual entry queue with specific quality failure message.


Step 6: Structured Output Schema

All extraction results output in consistent JSON format for downstream system consumption.

{
  "extraction_id": "uuid",
  "document_id": "uuid",
  "document_type": "pay_stub",
  "extraction_timestamp": "2024-01-15T14:32:00Z",
  "model_name": "azure-di-custom-pay-stub-v2",
  "model_version": "2.1.0",
  "document_confidence": 0.91,
  "page_count": 2,
  "fields": {
    "employer_name": {
      "value": "Acme Corporation",
      "confidence": 0.97,
      "bounding_box": {"page": 1, "x": 120, "y": 45, "width": 200, "height": 20},
      "status": "auto_accepted"
    },
    "gross_pay_this_period": {
      "value": 4250.00,
      "value_type": "currency",
      "confidence": 0.89,
      "bounding_box": {"page": 1, "x": 450, "y": 310, "width": 80, "height": 18},
      "status": "flagged_for_review"
    }
  },
  "missing_fields": ["employee_ssn_last4"],
  "validation_results": {
    "pay_period_end_vs_today": "within_60_days",
    "ytd_gross_ge_period_gross": "pass",
    "cross_field_consistency": "pass"
  },
  "routing": "auto_process",
  "review_reasons": ["gross_pay_confidence_below_threshold"]
}

Output Format

Deliver two artifacts:

  1. Extraction Configuration Specification — For each document type in scope: field definition table (name, type, required, validation, extraction note), confidence thresholds, model selection, and pre-processing requirements

  2. Output Schema Documentation — JSON schema for extraction results with field definitions, status enum values, routing logic, and validation rule definitions