Private Credit Intelligence Engine

An agentic document intelligence system that ingests credit agreement PDFs, extracts normalized covenant and pricing terms into structured data, self-verifies uncertain fields through an LLM-as-judge correction loop, grounds every field in source evidence, compares documents across deals, and generates portfolio-level risk analytics.

Origin story: This project started as an internship project in 2024, where I built an early LLM-based system for analyzing private credit agreements. Context windows were smaller, structured extraction was unreliable, and the prototype relied on prompt-heavy parsing, brittle chunking, and loosely structured outputs — but it proved the core workflow had real value.

Now, with more capable AI models and AI-assisted coding tools, I've rebuilt the entire system from the ground up — designed for reliability, traceability, and quantitative analytics. I'm curious to see how far this can go as both the underlying AI models and development tooling continue to improve.

Disclaimer: This is a personal project and is not affiliated with, endorsed by, or representative of any employer, past or present. The system may produce errors, hallucinations, or inaccurate extractions — outputs should always be verified by a qualified professional before use in any decision-making. All benchmark documents used are publicly available SEC EDGAR filings. This project is provided as-is for educational and research purposes.

Architecture

ASCII version

┌─────────────────────────────────────────────────────────────┐
│                     Credit Agreement PDF                     │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER A — INGESTION                                        │
│  ┌──────────┐  ┌───────────────┐  ┌──────────────────────┐  │
│  │ PDF Load │→ │ TOC Detection │→ │ Section/Heading Parse │  │
│  │ (PyMuPDF)│  │ (regex)       │  │ (regex + LLM fallback│  │
│  └──────────┘  └───────────────┘  └──────────────────────┘  │
└─────────────────────┬───────────────────────────────────────┘
                      │ pages + sections
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER B — HYBRID RETRIEVAL                                  │
│  ┌────────────────┐  ┌────────────────┐  ┌──────────────┐   │
│  │ Section-Aware  │→ │ OpenAI         │→ │ Semantic     │   │
│  │ Chunking       │  │ Embeddings     │  │ Search       │   │
│  │ (overlap)      │  │ (text-embed-3) │  │ (cosine sim) │   │
│  └────────────────┘  └────────────────┘  └──────┬───────┘   │
│                                                  │ + merge   │
│                                          ┌──────┴───────┐   │
│                                          │ Keyword      │   │
│                                          │ Search       │   │
│                                          │ (exact term) │   │
│                                          └──────────────┘   │
└─────────────────────┬───────────────────────────────────────┘
                      │ relevant chunks per query
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER C — EXTRACTION  (6 schema families)                  │
│                                                             │
│  ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌───────────────┐  │
│  │ Pricing  │ │ Leverage  │ │  Debt   │ │    EBITDA     │  │
│  │ Terms    │ │ Covenants │ │ Baskets │ │ Definitions   │  │
│  └──────────┘ └───────────┘ └─────────┘ └───────────────┘  │
│  ┌──────────────────┐ ┌─────────────────────────────────┐   │
│  │ Collateral /     │ │ Events of Default               │   │
│  │ Guarantor Package│ │                                 │   │
│  └──────────────────┘ └─────────────────────────────────┘   │
│                                                             │
│  Each extractor uses OpenAI or Anthropic Structured Outputs │
│  to return typed JSON with value + evidence + page citations│
│  All 6 families run concurrently via ThreadPoolExecutor     │
└─────────────────────┬───────────────────────────────────────┘
                      │ ExtractedField objects
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER D — VALIDATION                                       │
│  ┌──────────────────┐  ┌────────────────────────────────┐   │
│  │ Numeric Range    │  │ Citation Verification          │   │
│  │ Checks (%, x,   │  │ (evidence text ↔ source chunks)│   │
│  │ USD, bps, dates) │  │                                │   │
│  └──────────────────┘  └────────────────────────────────┘   │
└─────────────────────┬───────────────────────────────────────┘
                      │ validated results + warnings
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER D.5 — STAGED SELF-VERIFICATION (Agentic Loop)        │
│                                                             │
│  ┌──────────────────┐                                       │
│  │ Issue Detection  │  3 triggers: validation warnings,     │
│  │                  │  low confidence, ambiguous status      │
│  └────────┬─────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐  Deterministic                        │
│  │ Stage A: Informed│  evidence check   ──→ ACCEPT          │
│  │ (temp=0.2)       │  score >= 3?          (1 LLM call)    │
│  │ "X is hypothesis,│                                       │
│  │  prove from text"│  score < 3 ↓                          │
│  └──────────────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐  Deterministic                        │
│  │ Stage B: Blind   │  evidence check   ──→ ACCEPT          │
│  │ (temp=0.7)       │  score >= 3?          (2 LLM calls)   │
│  │ No prior answer, │                                       │
│  │ reversed chunks  │  score < 3 ↓                          │
│  └──────────────────┘                                       │
│           ▼                                                 │
│  ┌──────────────────┐                                       │
│  │ Stage C: Judge   │  Sees all candidates                  │
│  │ (temp=0.0)       │  + actual source text ──→ ACCEPT      │
│  │ Verify quotes    │  Can extract corrected    (3 calls)   │
│  │ against source   │  answer itself                        │
│  └──────────────────┘                                       │
│                                                             │
│  Repeats up to N rounds; permanently uncertain fields       │
│  are not retried to avoid wasting cost                      │
└─────────────────────┬───────────────────────────────────────┘
                      │ corrected results
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  LAYER E — ANALYTICS                                        │
│  ┌──────────────────┐ ┌─────────────┐ ┌─────────────────┐  │
│  │ Risk Scoring     │ │ Cross-Doc   │ │ Portfolio-Level │  │
│  │ • Covenant       │ │ Comparison  │ │ • Distributions │  │
│  │   Tightness      │ │ • Field Diff│ │ • Rankings      │  │
│  │ • Add-back       │ │ • Direction │ │ • Summary Table │  │
│  │   Aggressiveness │ │   (tighter/ │ │                 │  │
│  │ • Lender         │ │    looser)  │ │                 │  │
│  │   Protection     │ │             │ │                 │  │
│  └──────────────────┘ └─────────────┘ └─────────────────┘  │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│  OUTPUT — Streamlit Demo UI                                  │
│  • Single document extraction view with evidence panes      │
│  • Auto-generated term sheet with download                  │
│  • Gauge charts and radar profile for risk scores           │
│  • Side-by-side document comparison with direction badges   │
│  • Cross-model consensus view (OpenAI vs Claude)            │
│  • Portfolio analytics dashboard with charts and rankings   │
└─────────────────────────────────────────────────────────────┘

Extraction Schema

Every extracted field returns a structured record:

{
  "field_name": "max_first_lien_leverage",
  "value": "First Lien Leverage Ratio does not exceed 2.75:1.00 at incurrence",
  "numeric_value": 2.75,
  "operator": "<=",
  "units": "x",
  "scope": "incurrence test, pro forma",
  "page_numbers": [100, 101],
  "evidence_text": "Incremental ... may be incurred only if the First Lien Leverage Ratio ...",
  "status": "found",
  "confidence": 0.85,
  "exceptions": ["ratio tested on pro forma basis without netting cash proceeds"]
}

The six extraction families cover 44 fields total:

Family	Fields	Examples
Pricing Terms	6	Base rate, applicable margin, default rate, commitment fee, rate floor
Leverage Covenants	7	Max total/secured/first-lien leverage, min interest/fixed-charge coverage, testing frequency
Debt Baskets	8	Incremental capacity, ratio-based basket, general debt, permitted liens, RP basket, builder basket
EBITDA Definitions	8	Definition, cost savings/restructuring/transaction/stock-comp add-backs, pro forma, cap, run-rate period
Collateral Package	7	Collateral description, pledge percentages, excluded assets, guarantor coverage
Events of Default	8	Payment/covenant/cross-default, bankruptcy, judgment, change of control, cure period, MAE

Benchmark Results

Tested on three publicly filed SEC EDGAR credit agreements spanning different deal types, with both OpenAI and Anthropic models:

Document	Pages	Model	Fields Found	Time	Tokens	Cost	Verified	Corrected
Daseke Term Loan (leveraged)	182	GPT-5.1	38/44	80s	246,606	$0.74	6 flagged	3 fixed
Daseke Term Loan (leveraged)	182	Claude Opus 4.6	30/44	270s	164,192	$3.18	—	—
Daseke Term Loan (leveraged)	182	Claude Sonnet 4	30/44	224s	159,354	$0.58	—	—
Royal Wolf A$125M Syndicated	184	GPT-5.1	23/44	74s	107,931	$0.32	—	—
Pepco Holdings $200M Term Loan	85	GPT-5.1	15/44	55s	100,443	$0.29	—	—

Key observations:

Hybrid retrieval (semantic + keyword search) improved Daseke from 29/44 → 38/44 fields found (+31%). Pure embedding similarity misses exact defined terms (e.g., "Applicable Rate") buried in dense, multi-topic Definition section chunks. Keyword search catches these with zero additional API cost.
Domain-aware prompts teach the LLM about covenant-lite structures (incurrence tests vs. maintenance covenants), standard pledge conventions (domestic = 100% when only foreign is excluded at 65%), and terminology variants ("Applicable Rate" / "Applicable Margin" / "Applicable Spread"). These are general credit agreement patterns, not document-specific hacks.
Concurrent extraction cut Daseke processing from 135s → 48s (2.8x speedup) by running all 6 extractors in parallel.
Staged self-verification uses a three-stage escalation (informed → blind → source-grounded judge). On the Daseke deal: 5 fields resolved at Stage A, 1 required Stage C. Only 1 out of 6 flagged fields needed a judge call — 83% resolved by deterministic checks alone.
Claude produces fewer validation warnings (1 vs 3-5 for GPT) at higher cost, suggesting more careful citation grounding.
The investment-grade deal (Pepco) correctly returns not_found for EBITDA add-backs, debt baskets, and collateral — correct for an unsecured utility term loan.
The 6 remaining not_found on Daseke are genuinely absent: no coverage covenants or cure rights (covenant-lite), no explicit guarantor threshold, no springing lien.
Cross-model consensus on the Daseke deal shows 39.5% strict text agreement (expected — models phrase the same facts differently) with 15/38 fields in exact agreement and 16 needing review for wording differences.

Verification & Evaluation

Extraction correctness is enforced through a five-layer verification system:

Layer	What	How
1. Schema Enforcement	No malformed JSON	OpenAI Structured Outputs / Anthropic Tool Use
2. Domain Validation	No out-of-range values or hallucinated citations	Numeric range checks + evidence substring matching
3. Self-Verification	Agentic self-correction of flagged fields	3-stage escalation: informed → blind → judge
4. Evidence Traceability	Every field is auditable	`page_numbers` + `evidence_text` in every extraction
5. Labeled Evaluation	Systematic accuracy measurement	132 ground-truth fields, precision/recall/F1

Eval benchmark (3 documents, 132 fields): Precision 100% | Recall 94.7% | F1 97.3% | Numeric accuracy 96.5%

Design philosophy: precision over recall. A credit analyst would rather see "not found" than a wrong value.

Layer 2 — Domain Validation details

Numeric range checks (validate/numeric_rules.py): market-calibrated bounds per field type — leverage ratios must be 0–20x, margins 0–20%, pledge percentages 0–100%, run-rate periods 0–60 months. These catch gross errors like a dollar amount misread as a leverage ratio.
Citation verification (validate/citation_checks.py): for every found field, verifies that the evidence_text actually exists in the source chunks at the claimed page numbers. Uses exact substring matching first, then a 60% word-overlap fallback for minor LLM paraphrasing. This catches hallucinated citations — the model cannot claim evidence from text that doesn't exist in the document.

Layer 3 — Staged Self-Verification Loop details

This is the system's key agentic capability: after validation, the pipeline inspects its own output and decides what to re-examine. The design is informed by self-correction literature (surveyed in Kamoi et al. 2024, Huang et al. 2024, Dhuliawala et al. 2023) which shows that naive "please reconsider" prompts suffer from anchoring bias, while structured evidence-based verification improves accuracy reliably.

Three triggers identify fields needing review: validation warnings (failed numeric/citation checks), low confidence (< 0.7 on found fields), and ambiguous status.

For each flagged field, the system escalates through three stages:

Stage A — Informed verification (temperature 0.2, precise): The prior answer is treated as a falsifiable hypothesis, not a fact to confirm. The prompt asks the LLM to find the exact verbatim quote from the source text that proves or disproves the hypothesis. The response is then validated with deterministic evidence checks — the same citation verification and numeric range checks from Layer 2. If the returned quote is an exact substring of the source chunks (evidence score ≥ 3), the field is resolved without further LLM calls.

Stage B — Blind re-extraction (temperature 0.7, exploratory): Only reached when Stage A's evidence score is too low. The field is extracted from scratch with no prior answer shown and reversed chunk ordering to mitigate the "lost in the middle" positional bias documented in long-context retrieval research. Higher temperature encourages the model to explore different readings rather than repeating the same systematic error. The result goes through the same deterministic evidence checks.

Stage C — Source-grounded judge (temperature 0.0, deterministic): Only reached when both A and B produce weak evidence. This is not a generic "pick A or B" comparison — it is an informed re-extraction that sees all prior candidates AND the actual source text. The judge verifies each candidate's claimed evidence quote against the source, identifies which (if any) is correct, and can extract a corrected answer itself. This avoids the LLM judge reliability problems documented in the literature (position bias, verbosity bias) by grounding the decision in verifiable source text.

Evidence scoring rubric (deterministic, no LLM involved):

Check	Points	Rationale
Evidence quote is exact substring of source chunk	+3	Strongest signal — the model quoted real text
Evidence quote has ≥ 60% word overlap with source	+2	Allows minor paraphrasing while catching fabrications
Extracted value is non-null	+1	Field has content
Page numbers match source chunk pages	+1	Citation points to correct location
Evidence quote contains the numeric value	+1	Internal consistency between value and quote
Status "found" but no evidence provided	-1	Suspicious — claims to find something but shows no proof

A score of ≥ 3 means the evidence is grounded in the actual document text — safe to accept without an LLM judge.

Why three stages with different temperatures?

Low temperature (0.2) for Stage A: be precise about evidence, stick to what the text says.
High temperature (0.7) for Stage B: explore different interpretations, break out of self-consistent errors.
Zero temperature (0.0) for Stage C: make the most deterministic and consistent final judgment possible.

Why deterministic fast-pass filtering? The literature (Kamoi et al. 2024) shows that self-correction works reliably when feedback is external and concrete rather than the model's own belief. Our citation checks and numeric validators are exactly this: deterministic external signals that don't depend on LLM judgment. By using them as fast-pass filters, most fields are resolved cheaply (1 LLM call, 0 judge calls), and the expensive judge only runs for genuinely hard cases.

Benchmark on Daseke term loan (GPT-5.1):

5 fields flagged → 3 resolved at Stage A, 1 at Stage B, 1 at Stage C
Only 1 out of 5 fields required a judge call (80% resolved by deterministic checks)
97K extra tokens ($0.26) — significantly cheaper than a naive retry-everything approach
Fields that failed both A and B are marked permanently uncertain to avoid wasting cost on genuinely ambiguous provisions

The --no-verify flag disables verification when speed or cost matters more than accuracy. The verification_summary in the output JSON records exactly what was flagged, which stage resolved it, and the evidence scores, making the self-correction process fully auditable.

Layer 5 — Labeled Evaluation Framework details

Ground-truth annotations (eval/ground_truth.py): human-annotated expected values for all 44 fields across 3 documents (132 total), with expected status, numeric value, page numbers, and tolerance thresholds.
Per-field graders (eval/graders.py): status match (found/not_found), numeric accuracy (within configurable tolerance), and page overlap (F1-style harmonic mean with +/-3 page tolerance).
Evaluation reporter (eval/run_eval.py): computes precision, recall, F1 per extraction family and overall, with detailed mismatch listings.

python eval/run_eval.py

Metric	Score	Notes
Status accuracy	89.4%	118/132 fields correct
Precision	100.0%	Zero false positives — when the system says "found", it's always right
Recall	94.7%	Misses are mostly "ambiguous" instead of "found"
F1	97.3%
Numeric accuracy	96.5%	Extracted values within tolerance of ground truth
Page overlap	88.7%	Page citations reliably point to correct source text

Risk Scoring Methodology

The scoring module converts extracted terms into three quantitative 0–100 scores — Covenant Tightness, EBITDA Add-back Aggressiveness, and Lender Protection — with deal-type-aware weighting (leveraged vs. IG unsecured vs. syndicated, auto-detected from extraction patterns).

Overall = W_cov x Covenant + W_add x (100 - Addback) + W_prot x Protection

Document	Type	Covenant	Add-back	Protection	Overall
Pepco Holdings (IG utility)	ig_unsecured	45	20	25	41
Royal Wolf (syndicated)	syndicated	29	43	54	46
Daseke (leveraged)	leveraged	5	71	50	29

Scoring component breakdown

Covenant Tightness (0 = very loose, 100 = very tight)

Component	Max Points	Logic
Max total leverage ratio	25	Linear: 2x (tight) = 25, 8x (loose) = 0
Min interest coverage	20	Linear: 4x (tight) = 20, 1x (loose) = 0
Min fixed charge coverage	15	Linear: 3x (tight) = 15, 1x (loose) = 0
Testing frequency	15	Maintenance (quarterly) = 15, incurrence-only = 5
Equity cure rights	10	No cure = 10 (tighter), cure available = 3
Margin signal	15	Higher spread implies riskier credit -> tighter package

EBITDA Add-back Aggressiveness (0 = conservative, 100 = aggressive)

Component	Max Points	Logic
Add-back cap	30	0% cap = 0, 35% cap = 30; no cap = 20 (aggressive default)
Run-rate period	20	12 months = 0, 36 months = 20
Add-back breadth	30	Counts 4 add-back types (cost savings, restructuring, transaction, stock comp)
Pro-forma adjustments	20	Present = 15, absent = 0

Lender Protection (0 = weak, 100 = very strong)

Component	Max Points	Logic
Collateral coverage	20	"Substantially all assets" = 20, partial = 10, none = 0
Domestic pledge %	15	Linear 0-100%
Events-of-default breadth	25	Counts 6 EoD trigger types
Cross-default strictness	15	$10M threshold (strict) = 15, $100M (loose) = 0
Basket restrictions	25	Counts 4 basket types (defined baskets = restrictions exist)

Deal-Type-Aware Weighting

Deal Type	Covenant	Add-back	Protection	Rationale
Leveraged	30%	35%	35%	Add-back aggressiveness is the primary risk
IG Unsecured	50%	10%	40%	Covenants dominate; add-backs are irrelevant
Syndicated	35%	30%	35%	Balanced (default)

Deal type is auto-detected from extraction patterns: has EBITDA add-backs + collateral = leveraged; no collateral + no EBITDA = IG unsecured. Missing fields receive contextual annotations — e.g., "IG unsecured: no collateral expected (not a protection gap)" — so scores reflect deal economics rather than extraction gaps.

Project Structure

private-credit-intel/
├── app/
│   └── demo_ui.py              # Streamlit demo (single doc, compare, portfolio)
│
├── src/pci/
│   ├── models.py               # Pydantic domain models (44 extraction fields)
│   ├── config.py               # Settings (API keys, model params, chunking)
│   ├── pipeline.py             # Main orchestrator: ingest → retrieve → extract → validate → verify
│   │
│   ├── ingest/
│   │   ├── loaders.py          # PDF text extraction (PyMuPDF)
│   │   └── section_parser.py   # TOC detection, regex heading parse, LLM fallback
│   │
│   ├── retrieve/
│   │   ├── chunking.py         # Section-aware text chunking with overlap
│   │   └── search.py           # Hybrid search index (semantic + keyword)
│   │
│   ├── extract/
│   │   ├── base.py             # Dual-backend extraction (OpenAI + Anthropic)
│   │   ├── schemas.py          # JSON schemas + system prompts for all 6 families
│   │   ├── pricing.py          # Pricing terms extractor
│   │   ├── covenants.py        # Leverage / coverage covenants extractor
│   │   ├── baskets.py          # Debt / lien / investment / RP baskets extractor
│   │   ├── ebitda.py           # EBITDA definitions & add-backs extractor
│   │   ├── collateral.py       # Collateral & guarantor package extractor
│   │   ├── defaults.py         # Events of default extractor
│   │   └── verifier.py         # Self-verification loop (agentic re-extraction + LLM-as-judge)
│   │
│   ├── validate/
│   │   ├── numeric_rules.py    # Range checks per family (%, x, USD, bps)
│   │   └── citation_checks.py  # Evidence text ↔ source chunk verification
│   │
│   ├── analytics/
│   │   ├── scoring.py          # Covenant tightness, add-back aggressiveness, lender protection
│   │   ├── compare.py          # Cross-document field diff with direction analysis
│   │   ├── consensus.py        # Cross-model consensus (OpenAI vs Anthropic agreement)
│   │   └── portfolio.py        # Portfolio-level distributions, rankings, summary table
│   │
│   ├── termsheet/
│   │   └── generator.py        # Auto-generate markdown term sheet from extractions
│   │
│   └── storage/
│       └── repo.py             # JSON persistence (provider + model in filenames)
│
├── eval/
│   ├── ground_truth.py        # Human-annotated labels for 3 documents (132 fields)
│   ├── graders.py             # Status, numeric, and page-overlap graders
│   └── run_eval.py            # Evaluation runner with precision/recall/F1 reporting
│
├── scripts/
│   ├── run_single_doc_openai.py    # OpenAI extraction runner
│   ├── run_single_doc_anthropic.py # Anthropic extraction runner
│   └── run_consensus.py        # Cross-model consensus runner
│
├── tests/
├── input/                      # Source PDFs
├── output/                     # Extraction results (JSON + term sheets)
└── pyproject.toml

Quick Start

1. Install dependencies

pip install -e .

2. Set your API keys

cp .env.example .env
# Edit .env — set at minimum OPENAI_API_KEY (required for embeddings)
# Optionally set ANTHROPIC_API_KEY for Claude-based extraction
# Set LLM_PROVIDER=openai or LLM_PROVIDER=anthropic

3. Run extraction on a credit agreement

# With OpenAI (default) — includes self-verification loop
python scripts/run_single_doc_openai.py path/to/agreement.pdf --score --termsheet

# With Anthropic Claude
python scripts/run_single_doc_anthropic.py path/to/agreement.pdf --score

# Skip self-verification for faster/cheaper runs
python scripts/run_single_doc_openai.py path/to/agreement.pdf --no-verify

# Cross-model consensus (runs both, compares field-by-field)
python scripts/run_consensus.py path/to/agreement.pdf

4. Launch the demo UI

streamlit run app/demo_ui.py

Key Design Decisions

Why Structured Outputs over prompt-only JSON?

The 2024 prototype used prompt-based extraction with create_extraction_chain, which frequently produced malformed JSON, hallucinated fields, and required brittle parser workarounds. OpenAI's Structured Outputs guarantee schema adherence — the response is always valid, typed, and parseable. This eliminated an entire class of runtime errors.

Why hybrid retrieval (semantic + keyword)?

Pure embedding-based semantic search has a known blind spot: when a short, high-value definition (e.g., "Applicable Rate" means 4.00% for Term Benchmark Loans) is embedded in a 4000-character chunk containing 20+ other unrelated definitions, the embedding becomes a diluted average of all topics. The target definition's semantic signal is overwhelmed. In testing, the chunk containing "Applicable Rate" ranked 19th out of 475 chunks for our best semantic query — well below the top-10 cutoff.

Keyword search solves this with zero additional API cost: it scans chunk text for exact term matches (e.g., "Applicable Rate", "Material Adverse Effect", "Incremental Facility") and returns chunks ranked by keyword density. Each extractor defines a set of domain-standard keywords that cover the terminology variants used across credit agreements. The semantic and keyword results are merged and deduplicated before being sent to the LLM.

This is a lightweight implementation of the hybrid retrieval pattern (dense + sparse) widely used in production RAG systems (e.g., BM25 + dense retrieval with reciprocal rank fusion). The keyword component adds ~0ms latency (simple string matching over hundreds of chunks) and typically supplements the semantic results with 2-5 additional chunks containing exact defined terms that embeddings miss.

Why section-aware chunking?

Fixed-character chunking (the 2024 approach) splits text at arbitrary positions, often breaking a covenant definition across two chunks. Section-aware chunking respects document structure: chunks never cross Article or Section boundaries, so retrieval returns topically coherent context.

Why evidence grounding?

Every ExtractedField carries evidence_text and page_numbers. This serves two purposes:

Validation: the citation check layer verifies that the evidence text actually appears in the source chunks.
Trust: users can click "View Evidence" in the UI to see exactly what the model read.

Why risk scoring?

Raw extracted terms are useful for individual deal review, but portfolio-level analysis requires quantitative comparison. The scoring module converts extracted terms into normalized 0–100 scores across three dimensions (covenant tightness, add-back aggressiveness, lender protection), enabling cross-deal ranking and distribution analysis.

Why dual-provider (OpenAI + Anthropic)?

Both providers achieve structured extraction through different mechanisms: OpenAI uses JSON Schema via text.format, while Claude uses tool use with input_schema. The same JSON schema works for both — no duplication needed. Running both providers on the same document enables cross-model consensus: fields where both models agree have high confidence, while disagreements flag fields for human review. This provides a confidence signal without requiring ground-truth labels.

Why concurrent extraction?

The 6 extraction families are independent — each retrieves its own chunks and makes its own LLM call. Running them concurrently via ThreadPoolExecutor cuts extraction time by ~2.5x (135s → 48s for a 182-page document). The ChunkIndex is made thread-safe with a threading.Lock on lazy embedding initialization, and embeddings are pre-warmed before concurrent access.

Why a staged verification loop instead of just running twice?

A naive approach to improving accuracy would be to run extraction twice and take the consensus. But this doubles cost without targeting the actual problems. The self-correction literature also shows that simply asking a model to "reconsider" its answer often reinforces mistakes through anchoring bias (Huang et al. 2024), and that blind retries can repeat systematic errors when retrieval and context are unchanged.

Our staged design addresses both problems:

Selective: only re-extracts fields that failed validation, have low confidence, or are ambiguous — typically 5-15% of all fields, not 100%.
Evidence-constrained: Stage A treats the prior answer as a hypothesis to be proven from the text, not a fact to confirm. This converts self-correction from an introspective task ("am I right?") into a verification task ("can I find the evidence?"), which the literature shows is more reliable.
Independent second opinion: Stage B uses blind re-extraction with reversed chunk ordering and higher temperature — maximally independent from the first attempt.
Deterministic fast-pass: most fields are resolved by checking whether the evidence quote actually exists in the source text. No LLM judge needed for clear-cut cases.
Source-grounded judge as fallback: when deterministic checks can't resolve, the judge sees all candidates AND the actual source text, so it can verify quotes rather than just comparing "which sounds better."

This is the core agentic behavior in the system: the pipeline inspects its own output, identifies weaknesses, and takes corrective action — the hallmark of an agent versus a static pipeline. The escalation design means easy cases are cheap (1 LLM call) and only hard cases pay for the full 3-call pipeline.

Why not a multi-agent framework?

Agent orchestration adds complexity without proportional value for this pipeline. The architecture is a clean sequential pipeline (ingest → retrieve → extract → validate → verify → score) with no branching or dynamic routing needed. The self-verification loop adds targeted agentic self-correction where it matters most, without requiring a full agent orchestration framework. Keeping it simple makes the code auditable and the cost predictable.

Technical Stack

Component	Technology	Why
PDF parsing	PyMuPDF (fitz)	Fast, handles scanned PDFs, preserves page structure
Embeddings	OpenAI text-embedding-3-small	Cost-effective, good retrieval quality
Extraction	OpenAI Structured Outputs + Anthropic Tool Use	Dual-provider, same JSON schema, schema-guaranteed
Models	Pydantic v2	Type safety, JSON serialization, schema generation
Search	NumPy cosine similarity + keyword matching	Hybrid retrieval, no external vector DB dependency
Validation	Custom rule engine	Domain-specific numeric and citation checks
UI	Streamlit + Plotly	Interactive demo with charts, no frontend build step
Config	pydantic-settings + dotenv	Type-safe settings, no hardcoded credentials

What Changed from the 2024 Prototype

Aspect	2024 Prototype	2026 Rebuild
Section detection	LLM-only TOC parsing (5+ API calls)	Regex-first with TOC skip + LLM fallback
Chunking	Fixed 4000-char splits	Section-aware with configurable overlap
Retrieval	Semantic-only embedding search	Hybrid (semantic + keyword) with domain-specific term lists
Extraction	`create_extraction_chain` + prompt batching	Structured Outputs (schema-guaranteed), concurrent
LLM Provider	OpenAI only	Dual-provider (OpenAI + Anthropic) with consensus
Output	Summary paragraphs, no source links	44 normalized fields with evidence + auto-generated term sheets
Validation	None	Numeric range checks + citation verification + agentic self-correction
Evaluation	None	Labeled benchmark (132 fields), precision/recall/F1 reporting
Analytics	None	Risk scoring + cross-doc comparison + cross-model consensus + portfolio
Persistence	Pickle files	JSON with Pydantic serialization (provider/model in filenames)
Credentials	Hardcoded API keys	Environment variables via dotenv
UI	Desktop-only Tkinter	Web-based Streamlit with Plotly charts, term sheet download

References — Self-Verification Design (8 papers)

The staged self-verification loop is informed by findings from the LLM self-correction, anchoring bias, and information extraction literature. Below are the key papers that shaped the design and how each influenced specific engineering decisions.

Why naive self-correction fails (motivating the staged design)

Kamoi et al. (2024) — "When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs." TACL. Key finding: self-correction reliably improves only when there is reliable external feedback, not when the model merely critiques its own output. This motivated our deterministic citation/numeric checks as the primary decision mechanism. Paper
Huang et al. (2024) — "Large Language Models Cannot Self-Correct Reasoning Yet." ICLR. LLMs often cannot self-correct without external feedback, and performance can degrade. This is why Stage A reframes correction as an evidence verification task. Paper
Tyen et al. (2024) — "LLMs Cannot Find Reasoning Errors, But Can Correct Them!" Models are poor at detecting errors but good at correcting them once located. This validates using deterministic validators to locate problems, then asking the LLM to fix flagged fields. Paper

Why evidence-constrained verification works (motivating Stage A)

Dhuliawala et al. (2023) — "Chain-of-Verification (CoVe) Reduces Hallucination in Large Language Models." Verification questions must be answered independently to avoid anchoring — directly inspired Stage A's evidence-first design. Paper
Gero et al. (2023) — "Self-Verification Improves Few-Shot Clinical Information Extraction." Reports F1 gain of +0.056 with evidence-grounded verification in information extraction. Paper

Why blind retry uses reversed chunks (motivating Stage B)

Liu et al. (2024) — "Lost in the Middle: How Language Models Use Long Contexts." LLMs have degraded recall for middle-positioned information. Stage B reverses chunk ordering to break positional bias. Paper

Why we minimize LLM judge reliance (motivating deterministic adjudication)

Zheng et al. (2023) — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Documents position, verbosity, and self-enhancement bias in LLM judges. Paper

Why sampling diversity helps (motivating temperature variation)

Wang et al. (2023) — "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR. Multiple reasoning paths improve accuracy — supports temperature variation across stages. Paper
Manakul et al. (2023) — "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection." Independent re-sampling reveals instability correlated with errors. Paper

Disclaimer

This is a personal project built for educational and research purposes. It is not affiliated with, endorsed by, or representative of any employer, past or present. No proprietary data, confidential information, or employer IP was used in this project.

Accuracy: LLM-based extraction is inherently imperfect. Outputs may contain errors, hallucinations, or misinterpretations. Always verify results with a qualified professional before relying on them for any financial, legal, or business decision.
Benchmark data: All credit agreements used for testing are publicly available SEC EDGAR filings.
Not financial advice: Nothing in this project constitutes financial, legal, or investment advice.

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Private Credit Intelligence Engine

Architecture

Extraction Schema

Benchmark Results

Verification & Evaluation

Risk Scoring Methodology

Covenant Tightness (0 = very loose, 100 = very tight)

EBITDA Add-back Aggressiveness (0 = conservative, 100 = aggressive)

Lender Protection (0 = weak, 100 = very strong)

Deal-Type-Aware Weighting

Quick Start

1. Install dependencies

2. Set your API keys

3. Run extraction on a credit agreement

4. Launch the demo UI

Key Design Decisions

Technical Stack

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
app		app
docs		docs
eval		eval
scripts		scripts
src/pci		src/pci
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Private Credit Intelligence Engine

Architecture

Extraction Schema

Benchmark Results

Verification & Evaluation

Risk Scoring Methodology

Covenant Tightness (0 = very loose, 100 = very tight)

EBITDA Add-back Aggressiveness (0 = conservative, 100 = aggressive)

Lender Protection (0 = weak, 100 = very strong)

Deal-Type-Aware Weighting

Quick Start

1. Install dependencies

2. Set your API keys

3. Run extraction on a credit agreement

4. Launch the demo UI

Key Design Decisions

Technical Stack

Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages