🛡️ llm-eval-framework

LLM Evaluation & Validation Framework for Financial Services

An opinionated, local-first auditing suite for SR 11-7 and EU AI Act Compliance.

⚡ No-Install Quick Start

Don't want to clone the repo? Use one of these:

Option	Best for	Link
Google Colab	Running with a free GPU, easy sharing
HuggingFace Space	Zero-setup browser demo, HR / executive review
Local CLI	Production use, offline, Apple Silicon / CUDA	See Getting Started below

📖 Overview

"In banking, 'it works' is not a valid test result."

This framework automates the Model Risk Management artifacts that regulators actually require — turning raw LLM outputs into SR 11-7 compliant evidence. Built by a practitioner who has validated LLMs at a Fortune 50 financial institution.

Sample Report

(The glass-morphic, dynamic HTML report generated after validation)

This framework is local-first, optimized for on-premise hardware environments, ensuring that sensitive financial data never leaves your infrastructure during the validation process.

The Problem it Solves

You must prove why a model works, how it fails, and where it sits in the regulatory landscape (SR 11-7, OCC 2011-12, EU AI Act). This framework automates the generation of that statistical and visual proof.

🏗️ System Architecture

The framework follows a modular "Auditor-in-the-Loop" design:

Inference Wrapper: Standardized local inference using hardware-accelerated libraries.
The Registry: A YAML source of truth mapping metrics to regulatory clauses.
Evaluator Modules:
- Accuracy: Financial-F1 (Entity extraction integrity for tickers/amounts).
- Adversarial: 50+ Red-teaming templates (Jailbreaks/PII leaks).
- Explainability: SHAP/LIME token-level attribution.
Reporting Engine: Jinja2-based generator for "Committee-Ready" HTML/PDF reports.

📂 Project Structure

llm-eval-framework/
├── src/
│   ├── core/
│   │   └── interfaces.py             # Abstract base classes (SaliencyProvider)
│   ├── providers/
│   │   ├── mlx_provider.py           # MLX (Apple Silicon) saliency backend
│   │   └── torch_provider.py         # PyTorch / CUDA saliency backend
│   ├── services/
│   │   ├── accuracy_service.py       # Financial-F1 & entity extraction logic
│   │   ├── adversarial_service.py    # Prompt injection & red-teaming suite
│   │   ├── explainability_service.py # Attention-based token saliency
│   │   ├── report_service.py         # Jinja2 HTML report compiler
│   │   └── conflict_service.py       # Regulatory paradox detection
│   └── utils/
│       └── mapper.py                 # YAML metric-to-regulation mapper
├── configs/
│   ├── regulatory_mapping.yaml       # Bridges Metrics -> SR 11-7 / EU AI Act
│   └── system_prompts.yaml           # Hardened guardrails for local models
├── data/
│   ├── adversarial_library/          # 50+ JSON-based jailbreak templates
│   └── gold_standard/                # Reference datasets for finance
├── hf_space/
│   ├── app.py                        # Gradio web UI (HuggingFace Spaces)
│   └── requirements.txt              # Space-specific dependencies
├── reports/
│   ├── plots/                        # Generated saliency visualizations
│   └── templates/                    # Jinja2 HTML report template
├── scripts/
│   ├── download_model.py             # Hardware-aware model downloader
│   └── push_to_hf.sh                 # HuggingFace Space deploy script
├── tests/
│   ├── unit/                         # One file per service/provider
│   ├── integration/                  # End-to-end structural tests
│   └── test_data/                    # Fixture files (JSON test data)
├── notebooks/                        # Google Colab demo notebooks
├── main.py                           # CLI Entry Point
├── pyproject.toml                    # Ruff lint config + pytest settings
├── requirements.txt                  # Local-first dependencies
├── requirements-dev.txt              # Test-only dependencies (incl. ruff)
├── VERSION                           # Current version string
└── CHANGELOG.md                      # Keep a Changelog format

🚀 Getting Started

1. Prerequisites

Python 3.11+
Hardware Acceleration (e.g., CUDA/Metal - Optional)
Any Modern IDE

2. Installation

# Clone the repository
git clone https://github.com/mauryasameer/llm_eval.git
cd llm-eval-framework

# Setup Virtual Environment
python -m venv venv
source venv/bin/activate

# Install Core Dependencies (Auto-detects MLX for Mac or Transformers/Torch for CUDA)
pip install -r requirements.txt

3. Download a Local Model

Use the built-in CLI to download a model from HuggingFace and cache it locally:

# See recommended models for your platform
python scripts/download_model.py --list

# Download the default recommended model (auto-detects your hardware)
python scripts/download_model.py --model mlx-community/Llama-3.2-3B-Instruct-4bit

# Force a specific backend
python scripts/download_model.py --model Qwen/Qwen2.5-0.5B-Instruct --backend transformers

Models are cached in ~/.cache/huggingface/ and work fully offline after the first download.

4. Running a Validation Audit

# Run the full validation suite (Accuracy + Adversarial)
python main.py --eval all --model <local_model_path>

🛠️ The 5 Core Modules

1. Accuracy Evaluator (Financial-F1)

Standard NLP metrics ignore "Precision Severity." This module extracts Tickers, ISINs, and Monetary values, penalizing a $10B vs $10M error significantly higher than a grammatical one.

2. Adversarial Tester (Red-Teaming)

A library of 50+ adversarial templates including:

Fiduciary Bypass: Attempts to force unauthorized investment advice.
Data Leak Persona: Trick the model into revealing mock PII.
System Override: Testing resistance to instructions like "Ignore all previous safety rules."

3. Explainability Auditor

Generates SHAP plots showing which input tokens (e.g., "Interest Rate," "Default") most heavily influenced the model's decision. Required for Interpretability under SR 11-7 Section 3.3.

4. Regulatory Mapping

A YAML bridge that tags every technical test with a legal requirement:

financial_f1 ➡️ SR 11-7 Section 3.2
injection_pass_rate ➡️ EU AI Act Article 15

5. HTML Validation Report

A professional, glass-morphic report featuring:

Executive Summary: Pass/Fail badges.
Risk Heatmap: Visualizing Accuracy vs. Safety.
Audit Trail: A table linking model responses directly to regulatory controls.

🛡️ Compliance Standards

Regulation	Module	Focus Area
SR 11-7	Accuracy / Explainability	Model Soundness & Interpretability
EU AI Act	Adversarial / Security	Robustness & Cybersecurity
OCC 2011-12	Reporting	Audit Trail & Documentation

⚖️ License

Distributed under the MIT License. See LICENSE for more information.

📈 Social Support

If you're using this to harden your local LLMs, tag us! #GenerativeAI #ModelRisk #SR117 #OpenSource #LocalLLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ llm-eval-framework

LLM Evaluation & Validation Framework for Financial Services

⚡ No-Install Quick Start

📖 Overview

Sample Report

The Problem it Solves

🏗️ System Architecture

📂 Project Structure

🚀 Getting Started

1. Prerequisites

2. Installation

3. Download a Local Model

4. Running a Validation Audit

🛠️ The 5 Core Modules

1. Accuracy Evaluator (Financial-F1)

2. Adversarial Tester (Red-Teaming)

3. Explainability Auditor

4. Regulatory Mapping

5. HTML Validation Report

🛡️ Compliance Standards

⚖️ License

📈 Social Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
data		data
hf_space		hf_space
notebooks		notebooks
reports		reports
scripts		scripts
src		src
tests		tests
var/folders/m7/8kryggrn1g9ck47td9jbzn4h0000gn/T		var/folders/m7/8kryggrn1g9ck47td9jbzn4h0000gn/T
.gitignore		.gitignore
.mailmap		.mailmap
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
check_gpu.py		check_gpu.py
conftest.py		conftest.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
task.md		task.md

Folders and files

Latest commit

History

Repository files navigation

🛡️ llm-eval-framework

LLM Evaluation & Validation Framework for Financial Services

⚡ No-Install Quick Start

📖 Overview

Sample Report

The Problem it Solves

🏗️ System Architecture

📂 Project Structure

🚀 Getting Started

1. Prerequisites

2. Installation

3. Download a Local Model

4. Running a Validation Audit

🛠️ The 5 Core Modules

1. Accuracy Evaluator (Financial-F1)

2. Adversarial Tester (Red-Teaming)

3. Explainability Auditor

4. Regulatory Mapping

5. HTML Validation Report

🛡️ Compliance Standards

⚖️ License

📈 Social Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages