An opinionated, local-first auditing suite for SR 11-7 and EU AI Act Compliance.
Don't want to clone the repo? Use one of these:
| Option | Best for | Link |
|---|---|---|
| Google Colab | Running with a free GPU, easy sharing | |
| HuggingFace Space | Zero-setup browser demo, HR / executive review | |
| Local CLI | Production use, offline, Apple Silicon / CUDA | See Getting Started below |
"In banking, 'it works' is not a valid test result."
This framework automates the Model Risk Management artifacts that regulators actually require — turning raw LLM outputs into SR 11-7 compliant evidence. Built by a practitioner who has validated LLMs at a Fortune 50 financial institution.
(The glass-morphic, dynamic HTML report generated after validation)
This framework is local-first, optimized for on-premise hardware environments, ensuring that sensitive financial data never leaves your infrastructure during the validation process.
You must prove why a model works, how it fails, and where it sits in the regulatory landscape (SR 11-7, OCC 2011-12, EU AI Act). This framework automates the generation of that statistical and visual proof.
The framework follows a modular "Auditor-in-the-Loop" design:
- Inference Wrapper: Standardized local inference using hardware-accelerated libraries.
- The Registry: A
YAMLsource of truth mapping metrics to regulatory clauses. - Evaluator Modules:
- Accuracy: Financial-F1 (Entity extraction integrity for tickers/amounts).
- Adversarial: 50+ Red-teaming templates (Jailbreaks/PII leaks).
- Explainability: SHAP/LIME token-level attribution.
- Reporting Engine: Jinja2-based generator for "Committee-Ready" HTML/PDF reports.
llm-eval-framework/
├── src/
│ ├── core/
│ │ └── interfaces.py # Abstract base classes (SaliencyProvider)
│ ├── providers/
│ │ ├── mlx_provider.py # MLX (Apple Silicon) saliency backend
│ │ └── torch_provider.py # PyTorch / CUDA saliency backend
│ ├── services/
│ │ ├── accuracy_service.py # Financial-F1 & entity extraction logic
│ │ ├── adversarial_service.py # Prompt injection & red-teaming suite
│ │ ├── explainability_service.py # Attention-based token saliency
│ │ ├── report_service.py # Jinja2 HTML report compiler
│ │ └── conflict_service.py # Regulatory paradox detection
│ └── utils/
│ └── mapper.py # YAML metric-to-regulation mapper
├── configs/
│ ├── regulatory_mapping.yaml # Bridges Metrics -> SR 11-7 / EU AI Act
│ └── system_prompts.yaml # Hardened guardrails for local models
├── data/
│ ├── adversarial_library/ # 50+ JSON-based jailbreak templates
│ └── gold_standard/ # Reference datasets for finance
├── hf_space/
│ ├── app.py # Gradio web UI (HuggingFace Spaces)
│ └── requirements.txt # Space-specific dependencies
├── reports/
│ ├── plots/ # Generated saliency visualizations
│ └── templates/ # Jinja2 HTML report template
├── scripts/
│ ├── download_model.py # Hardware-aware model downloader
│ └── push_to_hf.sh # HuggingFace Space deploy script
├── tests/
│ ├── unit/ # One file per service/provider
│ ├── integration/ # End-to-end structural tests
│ └── test_data/ # Fixture files (JSON test data)
├── notebooks/ # Google Colab demo notebooks
├── main.py # CLI Entry Point
├── pyproject.toml # Ruff lint config + pytest settings
├── requirements.txt # Local-first dependencies
├── requirements-dev.txt # Test-only dependencies (incl. ruff)
├── VERSION # Current version string
└── CHANGELOG.md # Keep a Changelog format
- Python 3.11+
- Hardware Acceleration (e.g., CUDA/Metal - Optional)
- Any Modern IDE
# Clone the repository
git clone https://github.com/mauryasameer/llm_eval.git
cd llm-eval-framework
# Setup Virtual Environment
python -m venv venv
source venv/bin/activate
# Install Core Dependencies (Auto-detects MLX for Mac or Transformers/Torch for CUDA)
pip install -r requirements.txtUse the built-in CLI to download a model from HuggingFace and cache it locally:
# See recommended models for your platform
python scripts/download_model.py --list
# Download the default recommended model (auto-detects your hardware)
python scripts/download_model.py --model mlx-community/Llama-3.2-3B-Instruct-4bit
# Force a specific backend
python scripts/download_model.py --model Qwen/Qwen2.5-0.5B-Instruct --backend transformersModels are cached in
~/.cache/huggingface/and work fully offline after the first download.
# Run the full validation suite (Accuracy + Adversarial)
python main.py --eval all --model <local_model_path>Standard NLP metrics ignore "Precision Severity." This module extracts Tickers, ISINs, and Monetary values, penalizing a $10B vs $10M error significantly higher than a grammatical one.
A library of 50+ adversarial templates including:
- Fiduciary Bypass: Attempts to force unauthorized investment advice.
- Data Leak Persona: Trick the model into revealing mock PII.
- System Override: Testing resistance to instructions like "Ignore all previous safety rules."
Generates SHAP plots showing which input tokens (e.g., "Interest Rate," "Default") most heavily influenced the model's decision. Required for Interpretability under SR 11-7 Section 3.3.
A YAML bridge that tags every technical test with a legal requirement:
financial_f1➡️ SR 11-7 Section 3.2injection_pass_rate➡️ EU AI Act Article 15
A professional, glass-morphic report featuring:
- Executive Summary: Pass/Fail badges.
- Risk Heatmap: Visualizing Accuracy vs. Safety.
- Audit Trail: A table linking model responses directly to regulatory controls.
| Regulation | Module | Focus Area |
|---|---|---|
| SR 11-7 | Accuracy / Explainability | Model Soundness & Interpretability |
| EU AI Act | Adversarial / Security | Robustness & Cybersecurity |
| OCC 2011-12 | Reporting | Audit Trail & Documentation |
Distributed under the MIT License. See LICENSE for more information.
If you're using this to harden your local LLMs, tag us! #GenerativeAI #ModelRisk #SR117 #OpenSource #LocalLLM
