Automated financial anomaly detection for Paperless-ngx. Validates bank statements, invoices, and financial documents for arithmetic inconsistencies, formatting issues, and suspicious patterns. Features a web dashboard for monitoring and optional LLM enhancement for advanced analysis.
- 🧮 Balance Validation: Automatic verification of bank statement arithmetic
- 📐 Layout Analysis: Detects formatting irregularities and structural issues
- 🔍 Pattern Detection: Identifies duplicates, reversed columns, truncated totals
- 🤖 Optional LLM Enhancement: Claude/GPT integration for advanced reasoning
- 📊 Web Dashboard: Real-time monitoring with filters and statistics
- 🔄 Auto-Processing: Background polling for new documents
- 🏷️ Smart Tagging: Automatically adds anomaly tags to Paperless
- 📈 Custom Fields: Writes detection results to Paperless custom fields
- Features
- Architecture
- Use Cases
- Quick Start
- Configuration
- How It Works
- Web Dashboard
- API Reference
- Integration
- Troubleshooting
- Development
- FAQ
- License
Validates financial document math automatically:
- Balance Verification:
Beginning Balance + Credits - Debits = Ending Balance - Running Totals: Validates line-by-line balance progression
- Page Totals: Verifies subtotals match sum of transactions
- Configurable Tolerance: Customize acceptable variance (default: $0.01)
Tags Generated: anomaly:balance_mismatch
Custom Fields: balance_check_status, balance_diff_amount
Identifies formatting and structural issues:
- Column Alignment: Detects misaligned data columns
- Font Consistency: Identifies suspicious font variations
- Spacing Anomalies: Finds unusual spacing patterns
- Page Structure: Validates consistent page layouts
- Score-Based: Produces 0-1 layout quality score
Tags Generated: anomaly:layout_irregularity
Custom Fields: layout_score
Regex-based detection for common issues:
- Reversed Columns: Debits and credits swapped
- Duplicate Transactions: Repeated lines (copy/paste errors)
- Truncated Totals: Missing or incomplete totals
- Page Numbering Issues: Out of order or missing pages
- Date Sequence Problems: Non-chronological transactions
Tags Generated: anomaly:duplicate_lines, anomaly:reversed_columns, anomaly:truncated_total
Advanced analysis using Claude or GPT:
- Narrative Summaries: Human-readable anomaly explanations
- Context-Aware Analysis: Considers document type and patterns
- Confidence Scoring: Provides confidence levels for findings
- Evidence-Based: Only analyzes extracted data, never invents
Requirements: ANTHROPIC_API_KEY or OPENAI_API_KEY
Real-time monitoring interface:
- Document List: View all processed documents with anomaly indicators
- Filters: By anomaly type, date range, amount threshold
- Statistics: Overall detection rates and trends
- Quick Links: Direct links to Paperless documents
- Search: Find specific documents by ID or content
Access: http://localhost:8050
Automated polling system:
- Configurable Interval: Default 5 minutes, customize as needed
- State Persistence: Remembers last processed document
- Graceful Shutdown: Finishes current document before stopping
- Error Handling: Continues processing after transient failures
┌──────────────────┐
│ Paperless-ngx │
│ API │
└────────┬─────────┘
│ Poll for new documents
▼
┌──────────────────┐
│ Anomaly │
│ Detector │
│ │
│ 1. Fetch OCR │
│ 2. Infer Type │──► Bank Statement
│ 3. Extract Data │ Invoice
│ 4. Validate │ Receipt
└────────┬─────────┘
│
├──────────────────────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ Deterministic │ │ Optional LLM │
│ Checks │ │ Analysis │
│ │ │ │
│ - Balance math │ │ - Narrative │
│ - Layout score │ │ - Context │
│ - Patterns │ │ - Confidence │
└────────┬────────┘ └──────────┬──────────┘
│ │
└────────────┬───────────────┘
▼
┌────────────────────┐
│ Results Storage │
│ (SQLite/Postgres) │
└──────────┬─────────┘
▼
┌────────────────────┐
│ Write to Paperless │
│ - Tags │
│ - Custom Fields │
└────────────────────┘
- Scenario: Managing properties in litigation/receivership
- Benefit: Automatically flag suspicious bank statements and rent rolls
- Tags: Perfect for legal discovery and audit preparation
- Scenario: Reviewing monthly bank and credit card statements
- Benefit: Catch bank errors, duplicate charges, unauthorized transactions
- Scenario: Processing vendor invoices and customer payments
- Benefit: Detect duplicate invoices, math errors, fraudulent documents
- Scenario: Reviewing documents for tampering or manipulation
- Benefit: Layout irregularities often indicate modified PDFs
- Scenario: M&A document review, loan applications
- Benefit: Automated validation of financial statements
- Docker and Docker Compose
- Running Paperless-ngx instance (v1.10.0+)
- Paperless API token (generate here)
services:
paperless-anomaly-detector:
image: dblagbro/paperless-anomaly-detector:latest
container_name: paperless-anomaly-detector
restart: unless-stopped
environment:
PAPERLESS_API_BASE_URL: http://paperless-web:8000
PAPERLESS_API_TOKEN: your_token_here
POLLING_INTERVAL: 300
BALANCE_TOLERANCE: 0.01
volumes:
- ./anomaly-detector/data:/app/data
ports:
- "8050:8050"git clone https://github.com/dblagbro/paperless-anomaly-detector.git
cd paperless-anomaly-detector
docker build -t paperless-anomaly-detector .-
Create environment file:
cp .env.example .env
-
Edit
.envwith your settings:PAPERLESS_API_TOKEN=your_actual_token_here PAPERLESS_API_BASE_URL=http://paperless-web:8000
-
Start the service:
docker compose up -d
-
Verify it's running:
docker compose logs -f paperless-anomaly-detector
-
Access the dashboard:
http://localhost:8050 -
Trigger initial scan (optional):
curl -X POST http://localhost:8050/api/trigger-scan
| Variable | Default | Description |
|---|---|---|
PAPERLESS_API_BASE_URL |
http://paperless-web:8000 |
Paperless API endpoint |
PAPERLESS_API_TOKEN |
(required) | API authentication token |
POLLING_INTERVAL |
300 |
Seconds between polling cycles |
BALANCE_TOLERANCE |
0.01 |
Dollar tolerance for balance checks |
LAYOUT_VARIANCE_THRESHOLD |
0.3 |
Layout score threshold (0-1) |
LLM_PROVIDER |
None |
anthropic or openai |
LLM_API_KEY |
None |
LLM API key (if enabled) |
LLM_MODEL |
(auto) | Override model name |
BATCH_SIZE |
10 |
Documents per polling batch |
DATABASE_URL |
sqlite:///data/anomalies.db |
Database connection string |
Add to your environment:
environment:
LLM_PROVIDER: anthropic
LLM_API_KEY: sk-ant-api03-xxxOr for OpenAI:
environment:
LLM_PROVIDER: openai
LLM_API_KEY: sk-proj-xxx
LLM_MODEL: gpt-4-turbo-previewThe detector automatically creates these custom fields in Paperless:
- balance_check_status (Text): PASS / FAIL / NOT_APPLICABLE
- balance_diff_amount (Number): Dollar amount of mismatch
- layout_score (Number): 0-1 quality score
These are created on first run. No manual setup needed.
-
Polling Phase:
- Queries Paperless API every
POLLING_INTERVALseconds - Fetches documents with
modified > last_seen - Processes in batches of
BATCH_SIZE
- Queries Paperless API every
-
Content Extraction:
- Retrieves OCR text via Paperless API
- Extracts document metadata (title, date, tags)
- Identifies document type (bank statement, invoice, etc.)
-
Type Inference:
- Keyword matching for common document types
- Pattern recognition in content
- Falls back to generic analysis if unrecognized
-
Anomaly Detection:
- Balance Validation: Extracts beginning/ending balances, credits, debits
- Layout Analysis: Computes structural consistency score
- Pattern Matching: Applies regex rules for common issues
- LLM Enhancement (optional): Sends findings for analysis
-
Results Storage:
- Saves to internal database (
processed_documents,anomaly_logs) - Includes severity, description, amounts, timestamps
- Saves to internal database (
-
Paperless Integration:
- Adds tags:
anomaly:detected,anomaly:balance_mismatch, etc. - Updates custom fields with results
- Never modifies original documents
- Adds tags:
| Tag | Meaning |
|---|---|
anomaly:detected |
At least one anomaly found |
anomaly:balance_mismatch |
Arithmetic inconsistency detected |
anomaly:layout_irregularity |
Formatting/structure issues |
anomaly:duplicate_lines |
Repeated transaction entries |
anomaly:truncated_total |
Missing or incomplete totals |
anomaly:reversed_columns |
Debit/credit columns swapped |
anomaly:page_numbering |
Page order issues |
Manual Tags (Recommended):
property:<id>- Property identifierrole:refereeorrole:receiver- Your capacitydoc_type:bank_statement,doc_type:rent_roll- Document typeperiod:YYYY-MM- Time period
- Document Cards: Visual cards for each processed document
- Anomaly Indicators: Red badges for detected issues
- Quick Stats: Total documents, anomalies found, success rate
- Filters: Type, date range, amount threshold
JSON response with:
{
"total_documents": 1234,
"documents_with_anomalies": 45,
"anomaly_rate": 3.6,
"by_type": {
"balance_mismatch": 20,
"layout_irregularity": 15,
"duplicate_lines": 10
}
}Query parameters:
anomaly_type: Filter by specific anomalymin_amount: Minimum balance discrepancymax_amount: Maximum balance discrepancystart_date: ISO format (2024-01-01)end_date: ISO format (2024-12-31)limit: Results per page (default: 50)offset: Pagination offset
Example:
curl "http://localhost:8050/api/documents?anomaly_type=balance_mismatch&min_amount=100"Health check endpoint.
Response: {"status": "healthy"}
Overall statistics.
Response:
{
"total_documents": 1234,
"documents_with_anomalies": 45,
"anomaly_rate": 3.6,
"by_type": {...}
}List processed documents.
Query Params: See Document List section
Response:
{
"documents": [...],
"total": 1234,
"limit": 50,
"offset": 0
}List all anomaly logs.
Query Params: document_id, severity, resolved
Response:
{
"anomalies": [
{
"id": 123,
"document_id": 456,
"anomaly_type": "balance_mismatch",
"severity": "high",
"description": "Beginning + Credits - Debits != Ending",
"amount": 150.00,
"detected_at": "2024-01-15T10:30:00Z"
}
]
}Manually trigger document polling.
Response: {"status": "scan_initiated"}
Add to your nginx.conf:
location /paperless-anomaly-detector/ {
proxy_pass http://localhost:8050/;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# URL rewriting for subpath
sub_filter_once off;
sub_filter 'href="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'href="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
sub_filter 'src="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'src="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
sub_filter 'action="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'action="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
}Then access at: https://yourdomain.com/paperless-anomaly-detector/
Create saved searches in Paperless:
-
High-Priority Anomalies:
tags:anomaly:balance_mismatch AND balance_diff_amount:>100 -
Recent Anomalies:
tags:anomaly:detected AND created:[now-7d TO now] -
Unresolved Issues:
tags:anomaly:detected AND NOT tags:reviewed
Symptoms: Dashboard shows 0 documents
Solutions:
-
Verify API connectivity:
docker exec paperless-anomaly-detector curl -H "Authorization: Token YOUR_TOKEN" \ http://paperless-web:8000/api/documents/?page_size=1
-
Check logs for errors:
docker compose logs -f paperless-anomaly-detector
-
Manually trigger scan:
curl -X POST http://localhost:8050/api/trigger-scan
-
Check API token permissions in Paperless
Symptoms: Too many anomalies detected
Solutions:
-
Increase
BALANCE_TOLERANCE:environment: BALANCE_TOLERANCE: 0.05 # $0.05 instead of $0.01
-
Adjust
LAYOUT_VARIANCE_THRESHOLD:environment: LAYOUT_VARIANCE_THRESHOLD: 0.5 # More lenient
-
Review pattern detection rules in
app/detector.py -
Use LLM enhancement for better context understanding
Symptoms: Slow processing, high CPU usage
Solutions:
-
Reduce batch size:
environment: BATCH_SIZE: 5
-
Increase polling interval:
environment: POLLING_INTERVAL: 600 # 10 minutes
-
Disable LLM if not needed:
environment: LLM_PROVIDER: ""
-
Use PostgreSQL instead of SQLite:
environment: DATABASE_URL: postgresql://user:pass@postgres:5432/anomalies
Symptoms: No LLM-enhanced analysis, errors in logs
Solutions:
-
Verify API key:
docker exec paperless-anomaly-detector printenv | grep LLM
-
Test API key manually:
curl -H "x-api-key: YOUR_KEY" https://api.anthropic.com/v1/messages -
Check rate limits in provider dashboard
-
Ensure
LLM_PROVIDERis set correctly
# Install dependencies
pip install -r requirements.txt
# Run locally
cd app
python main.py# Install test dependencies
pip install pytest pytest-cov
# Run tests
pytest tests/
# With coverage
pytest --cov=app tests/- Edit
app/detector.py - Add method to
AnomalyDetectorclass:def detect_my_anomaly(self, content, metadata): """Detect my custom anomaly.""" findings = [] # Your logic here return findings
- Update
detect_all_anomalies()to call your method - Add corresponding tag handling
- Test thoroughly
processed_documents:
CREATE TABLE processed_documents (
id INTEGER PRIMARY KEY,
paperless_doc_id INTEGER UNIQUE,
title TEXT,
processed_at TIMESTAMP,
has_anomalies BOOLEAN,
balance_status TEXT,
balance_diff REAL,
layout_score REAL
);anomaly_logs:
CREATE TABLE anomaly_logs (
id INTEGER PRIMARY KEY,
document_id INTEGER REFERENCES processed_documents(id),
anomaly_type TEXT,
severity TEXT,
description TEXT,
amount REAL,
detected_at TIMESTAMP,
resolved BOOLEAN DEFAULT 0
);- Memory: 200-500MB depending on document volume
- CPU: Low during polling, spikes during processing
- Disk: SQLite database grows ~10KB per document
Typical processing times (Intel i7, 16GB RAM):
| Document Type | Pages | Processing Time |
|---|---|---|
| Bank Statement | 2 | 2-4 seconds |
| Invoice | 1 | 1-2 seconds |
| Credit Card | 5 | 5-8 seconds |
With LLM enabled, add 1-3 seconds per document
For high-volume deployments:
- Use PostgreSQL instead of SQLite
- Increase
BATCH_SIZEfor better throughput - Run multiple instances with partitioned document sets
- Consider async processing with message queue
-
API Token: Never logged or exposed in responses. Store in environment variable.
-
Database: SQLite by default. Use PostgreSQL with encrypted connections for production.
-
HTTPS: Always use NGINX reverse proxy with TLS in production.
-
Access Control: Add HTTP basic auth via NGINX for additional security.
-
Read-Only: Service only reads documents and writes tags/fields. Never modifies originals.
-
Audit Trail: All actions logged with timestamps in application logs.
Q: Can I reprocess documents?
A: Yes, clear the database and restart: docker exec paperless-anomaly-detector rm /app/data/anomalies.db
Q: Does this work with scanned documents? A: Yes, as long as Paperless has performed OCR. Quality depends on scan quality.
Q: Can I customize which anomalies are detected?
A: Yes, edit app/detector.py to add/remove detection rules.
Q: What document types are supported? A: Bank statements, credit cards, invoices, receipts. Easily extensible.
Q: How accurate is the balance validation? A: Very accurate for properly formatted statements. Configure tolerance for edge cases.
Q: Can I use this without LLM? A: Yes, deterministic checks work fine without LLM. LLM is optional enhancement.
Q: Does this modify my documents? A: No, it only adds tags and custom fields. Original PDFs are never modified.
Q: Can I run this on multiple Paperless instances?
A: Run separate containers with different PAPERLESS_API_TOKEN values.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a Pull Request
MIT License - see LICENSE file for details.
- Built for property management and financial auditing use cases
- Integrates with Paperless-ngx
- Optional LLM support via Anthropic Claude or OpenAI
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See CHANGELOG.md for version history
Perfect for property managers, accountants, auditors, and anyone who needs automated financial document validation.