Skip to content

spirituslab/private-credit-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Private Credit Intelligence Engine

An agentic document intelligence system that ingests credit agreement PDFs, extracts normalized covenant and pricing terms into structured data, self-verifies uncertain fields through an LLM-as-judge correction loop, grounds every field in source evidence, compares documents across deals, and generates portfolio-level risk analytics.

Origin story: This project started as an internship project in 2024, where I built an early LLM-based system for analyzing private credit agreements. Context windows were smaller, structured extraction was unreliable, and the prototype relied on prompt-heavy parsing, brittle chunking, and loosely structured outputs — but it proved the core workflow had real value.

Now, with more capable AI models and AI-assisted coding tools, I've rebuilt the entire system from the ground up — designed for reliability, traceability, and quantitative analytics. I'm curious to see how far this can go as both the underlying AI models and development tooling continue to improve.

Disclaimer: This is a personal project and is not affiliated with, endorsed by, or representative of any employer, past or present. The system may produce errors, hallucinations, or inaccurate extractions — outputs should always be verified by a qualified professional before use in any decision-making. All benchmark documents used are publicly available SEC EDGAR filings. This project is provided as-is for educational and research purposes.


Architecture

ASCII version
┌─────────────────────────────────────────────────────────────┐
│                     Credit Agreement PDF                     │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER A — INGESTION                                        │
│  ┌──────────┐  ┌───────────────┐  ┌──────────────────────┐  │
│  │ PDF Load │→ │ TOC Detection │→ │ Section/Heading Parse │  │
│  │ (PyMuPDF)│  │ (regex)       │  │ (regex + LLM fallback│  │
│  └──────────┘  └───────────────┘  └──────────────────────┘  │
└─────────────────────┬───────────────────────────────────────┘
                      │ pages + sections
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER B — HYBRID RETRIEVAL                                  │
│  ┌────────────────┐  ┌────────────────┐  ┌──────────────┐   │
│  │ Section-Aware  │→ │ OpenAI         │→ │ Semantic     │   │
│  │ Chunking       │  │ Embeddings     │  │ Search       │   │
│  │ (overlap)      │  │ (text-embed-3) │  │ (cosine sim) │   │
│  └────────────────┘  └────────────────┘  └──────┬───────┘   │
│                                                  │ + merge   │
│                                          ┌──────┴───────┐   │
│                                          │ Keyword      │   │
│                                          │ Search       │   │
│                                          │ (exact term) │   │
│                                          └──────────────┘   │
└─────────────────────┬───────────────────────────────────────┘
                      │ relevant chunks per query
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER C — EXTRACTION  (6 schema families)                  │
│                                                             │
│  ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌───────────────┐  │
│  │ Pricing  │ │ Leverage  │ │  Debt   │ │    EBITDA     │  │
│  │ Terms    │ │ Covenants │ │ Baskets │ │ Definitions   │  │
│  └──────────┘ └───────────┘ └─────────┘ └───────────────┘  │
│  ┌──────────────────┐ ┌─────────────────────────────────┐   │
│  │ Collateral /     │ │ Events of Default               │   │
│  │ Guarantor Package│ │                                 │   │
│  └──────────────────┘ └─────────────────────────────────┘   │
│                                                             │
│  Each extractor uses OpenAI or Anthropic Structured Outputs │
│  to return typed JSON with value + evidence + page citations│
│  All 6 families run concurrently via ThreadPoolExecutor     │
└─────────────────────┬───────────────────────────────────────┘
                      │ ExtractedField objects
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER D — VALIDATION                                       │
│  ┌──────────────────┐  ┌────────────────────────────────┐   │
│  │ Numeric Range    │  │ Citation Verification          │   │
│  │ Checks (%, x,   │  │ (evidence text ↔ source chunks)│   │
│  │ USD, bps, dates) │  │                                │   │
│  └──────────────────┘  └────────────────────────────────┘   │
└─────────────────────┬───────────────────────────────────────┘
                      │ validated results + warnings
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER D.5 — STAGED SELF-VERIFICATION (Agentic Loop)        │
│                                                             │
│  ┌──────────────────┐                                       │
│  │ Issue Detection  │  3 triggers: validation warnings,     │
│  │                  │  low confidence, ambiguous status      │
│  └────────┬─────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐  Deterministic                        │
│  │ Stage A: Informed│  evidence check   ──→ ACCEPT          │
│  │ (temp=0.2)       │  score >= 3?          (1 LLM call)    │
│  │ "X is hypothesis,│                                       │
│  │  prove from text"│  score < 3 ↓                          │
│  └──────────────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐  Deterministic                        │
│  │ Stage B: Blind   │  evidence check   ──→ ACCEPT          │
│  │ (temp=0.7)       │  score >= 3?          (2 LLM calls)   │
│  │ No prior answer, │                                       │
│  │ reversed chunks  │  score < 3 ↓                          │
│  └──────────────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐                                       │
│  │ Stage C: Judge   │  Sees all candidates                  │
│  │ (temp=0.0)       │  + actual source text ──→ ACCEPT      │
│  │ Verify quotes    │  Can extract corrected    (3 calls)   │
│  │ against source   │  answer itself                        │
│  └──────────────────┘                                       │
│                                                             │
│  Repeats up to N rounds; permanently uncertain fields       │
│  are not retried to avoid wasting cost                      │
└─────────────────────┬───────────────────────────────────────┘
                      │ corrected results
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER E — ANALYTICS                                        │
│  ┌──────────────────┐ ┌─────────────┐ ┌─────────────────┐  │
│  │ Risk Scoring     │ │ Cross-Doc   │ │ Portfolio-Level │  │
│  │ • Covenant       │ │ Comparison  │ │ • Distributions │  │
│  │   Tightness      │ │ • Field Diff│ │ • Rankings      │  │
│  │ • Add-back       │ │ • Direction │ │ • Summary Table │  │
│  │   Aggressiveness │ │   (tighter/ │ │                 │  │
│  │ • Lender         │ │    looser)  │ │                 │  │
│  │   Protection     │ │             │ │                 │  │
│  └──────────────────┘ └─────────────┘ └─────────────────┘  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  OUTPUT — Streamlit Demo UI                                  │
│  • Single document extraction view with evidence panes      │
│  • Auto-generated term sheet with download                  │
│  • Gauge charts and radar profile for risk scores           │
│  • Side-by-side document comparison with direction badges   │
│  • Cross-model consensus view (OpenAI vs Claude)            │
│  • Portfolio analytics dashboard with charts and rankings   │
└─────────────────────────────────────────────────────────────┘

Extraction Schema

Every extracted field returns a structured record:

{
  "field_name": "max_first_lien_leverage",
  "value": "First Lien Leverage Ratio does not exceed 2.75:1.00 at incurrence",
  "numeric_value": 2.75,
  "operator": "<=",
  "units": "x",
  "scope": "incurrence test, pro forma",
  "page_numbers": [100, 101],
  "evidence_text": "Incremental ... may be incurred only if the First Lien Leverage Ratio ...",
  "status": "found",
  "confidence": 0.85,
  "exceptions": ["ratio tested on pro forma basis without netting cash proceeds"]
}

The six extraction families cover 44 fields total:

Family Fields Examples
Pricing Terms 6 Base rate, applicable margin, default rate, commitment fee, rate floor
Leverage Covenants 7 Max total/secured/first-lien leverage, min interest/fixed-charge coverage, testing frequency
Debt Baskets 8 Incremental capacity, ratio-based basket, general debt, permitted liens, RP basket, builder basket
EBITDA Definitions 8 Definition, cost savings/restructuring/transaction/stock-comp add-backs, pro forma, cap, run-rate period
Collateral Package 7 Collateral description, pledge percentages, excluded assets, guarantor coverage
Events of Default 8 Payment/covenant/cross-default, bankruptcy, judgment, change of control, cure period, MAE

Benchmark Results

Tested on three publicly filed SEC EDGAR credit agreements spanning different deal types, with both OpenAI and Anthropic models:

Document Pages Model Fields Found Time Tokens Cost Verified Corrected
Daseke Term Loan (leveraged) 182 GPT-5.1 38/44 80s 246,606 $0.74 6 flagged 3 fixed
Daseke Term Loan (leveraged) 182 Claude Opus 4.6 30/44 270s 164,192 $3.18
Daseke Term Loan (leveraged) 182 Claude Sonnet 4 30/44 224s 159,354 $0.58
Royal Wolf A$125M Syndicated 184 GPT-5.1 23/44 74s 107,931 $0.32
Pepco Holdings $200M Term Loan 85 GPT-5.1 15/44 55s 100,443 $0.29

Key observations:

  • Hybrid retrieval (semantic + keyword search) improved Daseke from 29/44 → 38/44 fields found (+31%). Pure embedding similarity misses exact defined terms (e.g., "Applicable Rate") buried in dense, multi-topic Definition section chunks. Keyword search catches these with zero additional API cost.
  • Domain-aware prompts teach the LLM about covenant-lite structures (incurrence tests vs. maintenance covenants), standard pledge conventions (domestic = 100% when only foreign is excluded at 65%), and terminology variants ("Applicable Rate" / "Applicable Margin" / "Applicable Spread"). These are general credit agreement patterns, not document-specific hacks.
  • Concurrent extraction cut Daseke processing from 135s → 48s (2.8x speedup) by running all 6 extractors in parallel.
  • Staged self-verification uses a three-stage escalation (informed → blind → source-grounded judge). On the Daseke deal: 5 fields resolved at Stage A, 1 required Stage C. Only 1 out of 6 flagged fields needed a judge call — 83% resolved by deterministic checks alone.
  • Claude produces fewer validation warnings (1 vs 3-5 for GPT) at higher cost, suggesting more careful citation grounding.
  • The investment-grade deal (Pepco) correctly returns not_found for EBITDA add-backs, debt baskets, and collateral — correct for an unsecured utility term loan.
  • The 6 remaining not_found on Daseke are genuinely absent: no coverage covenants or cure rights (covenant-lite), no explicit guarantor threshold, no springing lien.
  • Cross-model consensus on the Daseke deal shows 39.5% strict text agreement (expected — models phrase the same facts differently) with 15/38 fields in exact agreement and 16 needing review for wording differences.

Verification & Evaluation

Extraction correctness is enforced through a five-layer verification system:

Layer What How
1. Schema Enforcement No malformed JSON OpenAI Structured Outputs / Anthropic Tool Use
2. Domain Validation No out-of-range values or hallucinated citations Numeric range checks + evidence substring matching
3. Self-Verification Agentic self-correction of flagged fields 3-stage escalation: informed → blind → judge
4. Evidence Traceability Every field is auditable page_numbers + evidence_text in every extraction
5. Labeled Evaluation Systematic accuracy measurement 132 ground-truth fields, precision/recall/F1

Eval benchmark (3 documents, 132 fields): Precision 100% | Recall 94.7% | F1 97.3% | Numeric accuracy 96.5%

Design philosophy: precision over recall. A credit analyst would rather see "not found" than a wrong value.

Layer 2 — Domain Validation details
  • Numeric range checks (validate/numeric_rules.py): market-calibrated bounds per field type — leverage ratios must be 0–20x, margins 0–20%, pledge percentages 0–100%, run-rate periods 0–60 months. These catch gross errors like a dollar amount misread as a leverage ratio.
  • Citation verification (validate/citation_checks.py): for every found field, verifies that the evidence_text actually exists in the source chunks at the claimed page numbers. Uses exact substring matching first, then a 60% word-overlap fallback for minor LLM paraphrasing. This catches hallucinated citations — the model cannot claim evidence from text that doesn't exist in the document.
Layer 3 — Staged Self-Verification Loop details

This is the system's key agentic capability: after validation, the pipeline inspects its own output and decides what to re-examine. The design is informed by self-correction literature (surveyed in Kamoi et al. 2024, Huang et al. 2024, Dhuliawala et al. 2023) which shows that naive "please reconsider" prompts suffer from anchoring bias, while structured evidence-based verification improves accuracy reliably.

Three triggers identify fields needing review: validation warnings (failed numeric/citation checks), low confidence (< 0.7 on found fields), and ambiguous status.

For each flagged field, the system escalates through three stages:

Stage A — Informed verification (temperature 0.2, precise): The prior answer is treated as a falsifiable hypothesis, not a fact to confirm. The prompt asks the LLM to find the exact verbatim quote from the source text that proves or disproves the hypothesis. The response is then validated with deterministic evidence checks — the same citation verification and numeric range checks from Layer 2. If the returned quote is an exact substring of the source chunks (evidence score ≥ 3), the field is resolved without further LLM calls.

Stage B — Blind re-extraction (temperature 0.7, exploratory): Only reached when Stage A's evidence score is too low. The field is extracted from scratch with no prior answer shown and reversed chunk ordering to mitigate the "lost in the middle" positional bias documented in long-context retrieval research. Higher temperature encourages the model to explore different readings rather than repeating the same systematic error. The result goes through the same deterministic evidence checks.

Stage C — Source-grounded judge (temperature 0.0, deterministic): Only reached when both A and B produce weak evidence. This is not a generic "pick A or B" comparison — it is an informed re-extraction that sees all prior candidates AND the actual source text. The judge verifies each candidate's claimed evidence quote against the source, identifies which (if any) is correct, and can extract a corrected answer itself. This avoids the LLM judge reliability problems documented in the literature (position bias, verbosity bias) by grounding the decision in verifiable source text.

Evidence scoring rubric (deterministic, no LLM involved):

Check Points Rationale
Evidence quote is exact substring of source chunk +3 Strongest signal — the model quoted real text
Evidence quote has ≥ 60% word overlap with source +2 Allows minor paraphrasing while catching fabrications
Extracted value is non-null +1 Field has content
Page numbers match source chunk pages +1 Citation points to correct location
Evidence quote contains the numeric value +1 Internal consistency between value and quote
Status "found" but no evidence provided -1 Suspicious — claims to find something but shows no proof

A score of ≥ 3 means the evidence is grounded in the actual document text — safe to accept without an LLM judge.

Why three stages with different temperatures?

  • Low temperature (0.2) for Stage A: be precise about evidence, stick to what the text says.
  • High temperature (0.7) for Stage B: explore different interpretations, break out of self-consistent errors.
  • Zero temperature (0.0) for Stage C: make the most deterministic and consistent final judgment possible.

Why deterministic fast-pass filtering? The literature (Kamoi et al. 2024) shows that self-correction works reliably when feedback is external and concrete rather than the model's own belief. Our citation checks and numeric validators are exactly this: deterministic external signals that don't depend on LLM judgment. By using them as fast-pass filters, most fields are resolved cheaply (1 LLM call, 0 judge calls), and the expensive judge only runs for genuinely hard cases.

Benchmark on Daseke term loan (GPT-5.1):

  • 5 fields flagged → 3 resolved at Stage A, 1 at Stage B, 1 at Stage C
  • Only 1 out of 5 fields required a judge call (80% resolved by deterministic checks)
  • 97K extra tokens ($0.26) — significantly cheaper than a naive retry-everything approach
  • Fields that failed both A and B are marked permanently uncertain to avoid wasting cost on genuinely ambiguous provisions

The --no-verify flag disables verification when speed or cost matters more than accuracy. The verification_summary in the output JSON records exactly what was flagged, which stage resolved it, and the evidence scores, making the self-correction process fully auditable.

Layer 5 — Labeled Evaluation Framework details
  • Ground-truth annotations (eval/ground_truth.py): human-annotated expected values for all 44 fields across 3 documents (132 total), with expected status, numeric value, page numbers, and tolerance thresholds.
  • Per-field graders (eval/graders.py): status match (found/not_found), numeric accuracy (within configurable tolerance), and page overlap (F1-style harmonic mean with +/-3 page tolerance).
  • Evaluation reporter (eval/run_eval.py): computes precision, recall, F1 per extraction family and overall, with detailed mismatch listings.
python eval/run_eval.py
Metric Score Notes
Status accuracy 89.4% 118/132 fields correct
Precision 100.0% Zero false positives — when the system says "found", it's always right
Recall 94.7% Misses are mostly "ambiguous" instead of "found"
F1 97.3%
Numeric accuracy 96.5% Extracted values within tolerance of ground truth
Page overlap 88.7% Page citations reliably point to correct source text

Risk Scoring Methodology

The scoring module converts extracted terms into three quantitative 0–100 scores — Covenant Tightness, EBITDA Add-back Aggressiveness, and Lender Protection — with deal-type-aware weighting (leveraged vs. IG unsecured vs. syndicated, auto-detected from extraction patterns).

Overall = W_cov x Covenant + W_add x (100 - Addback) + W_prot x Protection

Document Type Covenant Add-back Protection Overall
Pepco Holdings (IG utility) ig_unsecured 45 20 25 41
Royal Wolf (syndicated) syndicated 29 43 54 46
Daseke (leveraged) leveraged 5 71 50 29
Scoring component breakdown

Covenant Tightness (0 = very loose, 100 = very tight)

Component Max Points Logic
Max total leverage ratio 25 Linear: 2x (tight) = 25, 8x (loose) = 0
Min interest coverage 20 Linear: 4x (tight) = 20, 1x (loose) = 0
Min fixed charge coverage 15 Linear: 3x (tight) = 15, 1x (loose) = 0
Testing frequency 15 Maintenance (quarterly) = 15, incurrence-only = 5
Equity cure rights 10 No cure = 10 (tighter), cure available = 3
Margin signal 15 Higher spread implies riskier credit -> tighter package

EBITDA Add-back Aggressiveness (0 = conservative, 100 = aggressive)

Component Max Points Logic
Add-back cap 30 0% cap = 0, 35% cap = 30; no cap = 20 (aggressive default)
Run-rate period 20 12 months = 0, 36 months = 20
Add-back breadth 30 Counts 4 add-back types (cost savings, restructuring, transaction, stock comp)
Pro-forma adjustments 20 Present = 15, absent = 0

Lender Protection (0 = weak, 100 = very strong)

Component Max Points Logic
Collateral coverage 20 "Substantially all assets" = 20, partial = 10, none = 0
Domestic pledge % 15 Linear 0-100%
Events-of-default breadth 25 Counts 6 EoD trigger types
Cross-default strictness 15 $10M threshold (strict) = 15, $100M (loose) = 0
Basket restrictions 25 Counts 4 basket types (defined baskets = restrictions exist)

Deal-Type-Aware Weighting

Deal Type Covenant Add-back Protection Rationale
Leveraged 30% 35% 35% Add-back aggressiveness is the primary risk
IG Unsecured 50% 10% 40% Covenants dominate; add-backs are irrelevant
Syndicated 35% 30% 35% Balanced (default)

Deal type is auto-detected from extraction patterns: has EBITDA add-backs + collateral = leveraged; no collateral + no EBITDA = IG unsecured. Missing fields receive contextual annotations — e.g., "IG unsecured: no collateral expected (not a protection gap)" — so scores reflect deal economics rather than extraction gaps.


Project Structure
private-credit-intel/
├── app/
│   └── demo_ui.py              # Streamlit demo (single doc, compare, portfolio)
│
├── src/pci/
│   ├── models.py               # Pydantic domain models (44 extraction fields)
│   ├── config.py               # Settings (API keys, model params, chunking)
│   ├── pipeline.py             # Main orchestrator: ingest → retrieve → extract → validate → verify
│   │
│   ├── ingest/
│   │   ├── loaders.py          # PDF text extraction (PyMuPDF)
│   │   └── section_parser.py   # TOC detection, regex heading parse, LLM fallback
│   │
│   ├── retrieve/
│   │   ├── chunking.py         # Section-aware text chunking with overlap
│   │   └── search.py           # Hybrid search index (semantic + keyword)
│   │
│   ├── extract/
│   │   ├── base.py             # Dual-backend extraction (OpenAI + Anthropic)
│   │   ├── schemas.py          # JSON schemas + system prompts for all 6 families
│   │   ├── pricing.py          # Pricing terms extractor
│   │   ├── covenants.py        # Leverage / coverage covenants extractor
│   │   ├── baskets.py          # Debt / lien / investment / RP baskets extractor
│   │   ├── ebitda.py           # EBITDA definitions & add-backs extractor
│   │   ├── collateral.py       # Collateral & guarantor package extractor
│   │   ├── defaults.py         # Events of default extractor
│   │   └── verifier.py         # Self-verification loop (agentic re-extraction + LLM-as-judge)
│   │
│   ├── validate/
│   │   ├── numeric_rules.py    # Range checks per family (%, x, USD, bps)
│   │   └── citation_checks.py  # Evidence text ↔ source chunk verification
│   │
│   ├── analytics/
│   │   ├── scoring.py          # Covenant tightness, add-back aggressiveness, lender protection
│   │   ├── compare.py          # Cross-document field diff with direction analysis
│   │   ├── consensus.py        # Cross-model consensus (OpenAI vs Anthropic agreement)
│   │   └── portfolio.py        # Portfolio-level distributions, rankings, summary table
│   │
│   ├── termsheet/
│   │   └── generator.py        # Auto-generate markdown term sheet from extractions
│   │
│   └── storage/
│       └── repo.py             # JSON persistence (provider + model in filenames)
│
├── eval/
│   ├── ground_truth.py        # Human-annotated labels for 3 documents (132 fields)
│   ├── graders.py             # Status, numeric, and page-overlap graders
│   └── run_eval.py            # Evaluation runner with precision/recall/F1 reporting
│
├── scripts/
│   ├── run_single_doc_openai.py    # OpenAI extraction runner
│   ├── run_single_doc_anthropic.py # Anthropic extraction runner
│   └── run_consensus.py        # Cross-model consensus runner
│
├── tests/
├── input/                      # Source PDFs
├── output/                     # Extraction results (JSON + term sheets)
└── pyproject.toml

Quick Start

1. Install dependencies

pip install -e .

2. Set your API keys

cp .env.example .env
# Edit .env — set at minimum OPENAI_API_KEY (required for embeddings)
# Optionally set ANTHROPIC_API_KEY for Claude-based extraction
# Set LLM_PROVIDER=openai or LLM_PROVIDER=anthropic

3. Run extraction on a credit agreement

# With OpenAI (default) — includes self-verification loop
python scripts/run_single_doc_openai.py path/to/agreement.pdf --score --termsheet

# With Anthropic Claude
python scripts/run_single_doc_anthropic.py path/to/agreement.pdf --score

# Skip self-verification for faster/cheaper runs
python scripts/run_single_doc_openai.py path/to/agreement.pdf --no-verify

# Cross-model consensus (runs both, compares field-by-field)
python scripts/run_consensus.py path/to/agreement.pdf

4. Launch the demo UI

streamlit run app/demo_ui.py

Key Design Decisions

Why Structured Outputs over prompt-only JSON?

The 2024 prototype used prompt-based extraction with create_extraction_chain, which frequently produced malformed JSON, hallucinated fields, and required brittle parser workarounds. OpenAI's Structured Outputs guarantee schema adherence — the response is always valid, typed, and parseable. This eliminated an entire class of runtime errors.

Why hybrid retrieval (semantic + keyword)?

Pure embedding-based semantic search has a known blind spot: when a short, high-value definition (e.g., "Applicable Rate" means 4.00% for Term Benchmark Loans) is embedded in a 4000-character chunk containing 20+ other unrelated definitions, the embedding becomes a diluted average of all topics. The target definition's semantic signal is overwhelmed. In testing, the chunk containing "Applicable Rate" ranked 19th out of 475 chunks for our best semantic query — well below the top-10 cutoff.

Keyword search solves this with zero additional API cost: it scans chunk text for exact term matches (e.g., "Applicable Rate", "Material Adverse Effect", "Incremental Facility") and returns chunks ranked by keyword density. Each extractor defines a set of domain-standard keywords that cover the terminology variants used across credit agreements. The semantic and keyword results are merged and deduplicated before being sent to the LLM.

This is a lightweight implementation of the hybrid retrieval pattern (dense + sparse) widely used in production RAG systems (e.g., BM25 + dense retrieval with reciprocal rank fusion). The keyword component adds ~0ms latency (simple string matching over hundreds of chunks) and typically supplements the semantic results with 2-5 additional chunks containing exact defined terms that embeddings miss.

Why section-aware chunking?

Fixed-character chunking (the 2024 approach) splits text at arbitrary positions, often breaking a covenant definition across two chunks. Section-aware chunking respects document structure: chunks never cross Article or Section boundaries, so retrieval returns topically coherent context.

Why evidence grounding?

Every ExtractedField carries evidence_text and page_numbers. This serves two purposes:

  1. Validation: the citation check layer verifies that the evidence text actually appears in the source chunks.
  2. Trust: users can click "View Evidence" in the UI to see exactly what the model read.
Why risk scoring?

Raw extracted terms are useful for individual deal review, but portfolio-level analysis requires quantitative comparison. The scoring module converts extracted terms into normalized 0–100 scores across three dimensions (covenant tightness, add-back aggressiveness, lender protection), enabling cross-deal ranking and distribution analysis.

Why dual-provider (OpenAI + Anthropic)?

Both providers achieve structured extraction through different mechanisms: OpenAI uses JSON Schema via text.format, while Claude uses tool use with input_schema. The same JSON schema works for both — no duplication needed. Running both providers on the same document enables cross-model consensus: fields where both models agree have high confidence, while disagreements flag fields for human review. This provides a confidence signal without requiring ground-truth labels.

Why concurrent extraction?

The 6 extraction families are independent — each retrieves its own chunks and makes its own LLM call. Running them concurrently via ThreadPoolExecutor cuts extraction time by ~2.5x (135s → 48s for a 182-page document). The ChunkIndex is made thread-safe with a threading.Lock on lazy embedding initialization, and embeddings are pre-warmed before concurrent access.

Why a staged verification loop instead of just running twice?

A naive approach to improving accuracy would be to run extraction twice and take the consensus. But this doubles cost without targeting the actual problems. The self-correction literature also shows that simply asking a model to "reconsider" its answer often reinforces mistakes through anchoring bias (Huang et al. 2024), and that blind retries can repeat systematic errors when retrieval and context are unchanged.

Our staged design addresses both problems:

  1. Selective: only re-extracts fields that failed validation, have low confidence, or are ambiguous — typically 5-15% of all fields, not 100%.
  2. Evidence-constrained: Stage A treats the prior answer as a hypothesis to be proven from the text, not a fact to confirm. This converts self-correction from an introspective task ("am I right?") into a verification task ("can I find the evidence?"), which the literature shows is more reliable.
  3. Independent second opinion: Stage B uses blind re-extraction with reversed chunk ordering and higher temperature — maximally independent from the first attempt.
  4. Deterministic fast-pass: most fields are resolved by checking whether the evidence quote actually exists in the source text. No LLM judge needed for clear-cut cases.
  5. Source-grounded judge as fallback: when deterministic checks can't resolve, the judge sees all candidates AND the actual source text, so it can verify quotes rather than just comparing "which sounds better."

This is the core agentic behavior in the system: the pipeline inspects its own output, identifies weaknesses, and takes corrective action — the hallmark of an agent versus a static pipeline. The escalation design means easy cases are cheap (1 LLM call) and only hard cases pay for the full 3-call pipeline.

Why not a multi-agent framework?

Agent orchestration adds complexity without proportional value for this pipeline. The architecture is a clean sequential pipeline (ingest → retrieve → extract → validate → verify → score) with no branching or dynamic routing needed. The self-verification loop adds targeted agentic self-correction where it matters most, without requiring a full agent orchestration framework. Keeping it simple makes the code auditable and the cost predictable.


Technical Stack

Component Technology Why
PDF parsing PyMuPDF (fitz) Fast, handles scanned PDFs, preserves page structure
Embeddings OpenAI text-embedding-3-small Cost-effective, good retrieval quality
Extraction OpenAI Structured Outputs + Anthropic Tool Use Dual-provider, same JSON schema, schema-guaranteed
Models Pydantic v2 Type safety, JSON serialization, schema generation
Search NumPy cosine similarity + keyword matching Hybrid retrieval, no external vector DB dependency
Validation Custom rule engine Domain-specific numeric and citation checks
UI Streamlit + Plotly Interactive demo with charts, no frontend build step
Config pydantic-settings + dotenv Type-safe settings, no hardcoded credentials
What Changed from the 2024 Prototype
Aspect 2024 Prototype 2026 Rebuild
Section detection LLM-only TOC parsing (5+ API calls) Regex-first with TOC skip + LLM fallback
Chunking Fixed 4000-char splits Section-aware with configurable overlap
Retrieval Semantic-only embedding search Hybrid (semantic + keyword) with domain-specific term lists
Extraction create_extraction_chain + prompt batching Structured Outputs (schema-guaranteed), concurrent
LLM Provider OpenAI only Dual-provider (OpenAI + Anthropic) with consensus
Output Summary paragraphs, no source links 44 normalized fields with evidence + auto-generated term sheets
Validation None Numeric range checks + citation verification + agentic self-correction
Evaluation None Labeled benchmark (132 fields), precision/recall/F1 reporting
Analytics None Risk scoring + cross-doc comparison + cross-model consensus + portfolio
Persistence Pickle files JSON with Pydantic serialization (provider/model in filenames)
Credentials Hardcoded API keys Environment variables via dotenv
UI Desktop-only Tkinter Web-based Streamlit with Plotly charts, term sheet download
References — Self-Verification Design (8 papers)

The staged self-verification loop is informed by findings from the LLM self-correction, anchoring bias, and information extraction literature. Below are the key papers that shaped the design and how each influenced specific engineering decisions.

Why naive self-correction fails (motivating the staged design)

  • Kamoi et al. (2024)"When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs." TACL. Key finding: self-correction reliably improves only when there is reliable external feedback, not when the model merely critiques its own output. This motivated our deterministic citation/numeric checks as the primary decision mechanism. Paper

  • Huang et al. (2024)"Large Language Models Cannot Self-Correct Reasoning Yet." ICLR. LLMs often cannot self-correct without external feedback, and performance can degrade. This is why Stage A reframes correction as an evidence verification task. Paper

  • Tyen et al. (2024)"LLMs Cannot Find Reasoning Errors, But Can Correct Them!" Models are poor at detecting errors but good at correcting them once located. This validates using deterministic validators to locate problems, then asking the LLM to fix flagged fields. Paper

Why evidence-constrained verification works (motivating Stage A)

  • Dhuliawala et al. (2023)"Chain-of-Verification (CoVe) Reduces Hallucination in Large Language Models." Verification questions must be answered independently to avoid anchoring — directly inspired Stage A's evidence-first design. Paper

  • Gero et al. (2023)"Self-Verification Improves Few-Shot Clinical Information Extraction." Reports F1 gain of +0.056 with evidence-grounded verification in information extraction. Paper

Why blind retry uses reversed chunks (motivating Stage B)

  • Liu et al. (2024)"Lost in the Middle: How Language Models Use Long Contexts." LLMs have degraded recall for middle-positioned information. Stage B reverses chunk ordering to break positional bias. Paper

Why we minimize LLM judge reliance (motivating deterministic adjudication)

  • Zheng et al. (2023)"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Documents position, verbosity, and self-enhancement bias in LLM judges. Paper

Why sampling diversity helps (motivating temperature variation)

  • Wang et al. (2023)"Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR. Multiple reasoning paths improve accuracy — supports temperature variation across stages. Paper

  • Manakul et al. (2023)"SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection." Independent re-sampling reveals instability correlated with errors. Paper


Disclaimer

This is a personal project built for educational and research purposes. It is not affiliated with, endorsed by, or representative of any employer, past or present. No proprietary data, confidential information, or employer IP was used in this project.

  • Accuracy: LLM-based extraction is inherently imperfect. Outputs may contain errors, hallucinations, or misinterpretations. Always verify results with a qualified professional before relying on them for any financial, legal, or business decision.
  • Benchmark data: All credit agreements used for testing are publicly available SEC EDGAR filings.
  • Not financial advice: Nothing in this project constitutes financial, legal, or investment advice.

License

MIT License — see LICENSE for details.

About

Document intelligence system for credit agreement PDFs — extracts 44 structured fields, validates with citation verification, and generates deal-type-aware risk scores

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages