Skip to content

dblagbro/paperless-anomaly-detector

Repository files navigation

Paperless Anomaly Detector

Docker Hub License: MIT Python 3.11+

Automated financial anomaly detection for Paperless-ngx. Validates bank statements, invoices, and financial documents for arithmetic inconsistencies, formatting issues, and suspicious patterns. Features a web dashboard for monitoring and optional LLM enhancement for advanced analysis.

🌟 Key Features

  • 🧮 Balance Validation: Automatic verification of bank statement arithmetic
  • 📐 Layout Analysis: Detects formatting irregularities and structural issues
  • 🔍 Pattern Detection: Identifies duplicates, reversed columns, truncated totals
  • 🤖 Optional LLM Enhancement: Claude/GPT integration for advanced reasoning
  • 📊 Web Dashboard: Real-time monitoring with filters and statistics
  • 🔄 Auto-Processing: Background polling for new documents
  • 🏷️ Smart Tagging: Automatically adds anomaly tags to Paperless
  • 📈 Custom Fields: Writes detection results to Paperless custom fields

📋 Table of Contents

Features

🧮 Arithmetic Consistency Checking

Validates financial document math automatically:

  • Balance Verification: Beginning Balance + Credits - Debits = Ending Balance
  • Running Totals: Validates line-by-line balance progression
  • Page Totals: Verifies subtotals match sum of transactions
  • Configurable Tolerance: Customize acceptable variance (default: $0.01)

Tags Generated: anomaly:balance_mismatch Custom Fields: balance_check_status, balance_diff_amount

📐 Layout Irregularity Detection

Identifies formatting and structural issues:

  • Column Alignment: Detects misaligned data columns
  • Font Consistency: Identifies suspicious font variations
  • Spacing Anomalies: Finds unusual spacing patterns
  • Page Structure: Validates consistent page layouts
  • Score-Based: Produces 0-1 layout quality score

Tags Generated: anomaly:layout_irregularity Custom Fields: layout_score

🔍 Pattern Detection

Regex-based detection for common issues:

  • Reversed Columns: Debits and credits swapped
  • Duplicate Transactions: Repeated lines (copy/paste errors)
  • Truncated Totals: Missing or incomplete totals
  • Page Numbering Issues: Out of order or missing pages
  • Date Sequence Problems: Non-chronological transactions

Tags Generated: anomaly:duplicate_lines, anomaly:reversed_columns, anomaly:truncated_total

🤖 LLM Enhancement (Optional)

Advanced analysis using Claude or GPT:

  • Narrative Summaries: Human-readable anomaly explanations
  • Context-Aware Analysis: Considers document type and patterns
  • Confidence Scoring: Provides confidence levels for findings
  • Evidence-Based: Only analyzes extracted data, never invents

Requirements: ANTHROPIC_API_KEY or OPENAI_API_KEY

📊 Web Dashboard

Real-time monitoring interface:

  • Document List: View all processed documents with anomaly indicators
  • Filters: By anomaly type, date range, amount threshold
  • Statistics: Overall detection rates and trends
  • Quick Links: Direct links to Paperless documents
  • Search: Find specific documents by ID or content

Access: http://localhost:8050

🔄 Background Processing

Automated polling system:

  • Configurable Interval: Default 5 minutes, customize as needed
  • State Persistence: Remembers last processed document
  • Graceful Shutdown: Finishes current document before stopping
  • Error Handling: Continues processing after transient failures

Architecture

┌──────────────────┐
│  Paperless-ngx   │
│      API         │
└────────┬─────────┘
         │ Poll for new documents
         ▼
┌──────────────────┐
│  Anomaly         │
│  Detector        │
│                  │
│  1. Fetch OCR    │
│  2. Infer Type   │──► Bank Statement
│  3. Extract Data │   Invoice
│  4. Validate     │   Receipt
└────────┬─────────┘
         │
         ├──────────────────────────────┐
         ▼                              ▼
┌─────────────────┐        ┌─────────────────────┐
│ Deterministic   │        │ Optional LLM        │
│ Checks          │        │ Analysis            │
│                 │        │                     │
│ - Balance math  │        │ - Narrative         │
│ - Layout score  │        │ - Context           │
│ - Patterns      │        │ - Confidence        │
└────────┬────────┘        └──────────┬──────────┘
         │                            │
         └────────────┬───────────────┘
                      ▼
           ┌────────────────────┐
           │ Results Storage    │
           │ (SQLite/Postgres)  │
           └──────────┬─────────┘
                      ▼
           ┌────────────────────┐
           │ Write to Paperless │
           │ - Tags             │
           │ - Custom Fields    │
           └────────────────────┘

Use Cases

💰 Property Management Accounting

  • Scenario: Managing properties in litigation/receivership
  • Benefit: Automatically flag suspicious bank statements and rent rolls
  • Tags: Perfect for legal discovery and audit preparation

🏦 Personal Finance Auditing

  • Scenario: Reviewing monthly bank and credit card statements
  • Benefit: Catch bank errors, duplicate charges, unauthorized transactions

📋 Accounts Payable/Receivable

  • Scenario: Processing vendor invoices and customer payments
  • Benefit: Detect duplicate invoices, math errors, fraudulent documents

🔍 Fraud Detection

  • Scenario: Reviewing documents for tampering or manipulation
  • Benefit: Layout irregularities often indicate modified PDFs

📊 Financial Due Diligence

  • Scenario: M&A document review, loan applications
  • Benefit: Automated validation of financial statements

🚀 Quick Start

Prerequisites

  • Docker and Docker Compose
  • Running Paperless-ngx instance (v1.10.0+)
  • Paperless API token (generate here)

Installation

Using Docker Hub

services:
  paperless-anomaly-detector:
    image: dblagbro/paperless-anomaly-detector:latest
    container_name: paperless-anomaly-detector
    restart: unless-stopped
    environment:
      PAPERLESS_API_BASE_URL: http://paperless-web:8000
      PAPERLESS_API_TOKEN: your_token_here
      POLLING_INTERVAL: 300
      BALANCE_TOLERANCE: 0.01
    volumes:
      - ./anomaly-detector/data:/app/data
    ports:
      - "8050:8050"

From Source

git clone https://github.com/dblagbro/paperless-anomaly-detector.git
cd paperless-anomaly-detector
docker build -t paperless-anomaly-detector .

Initial Setup

  1. Create environment file:

    cp .env.example .env
  2. Edit .env with your settings:

    PAPERLESS_API_TOKEN=your_actual_token_here
    PAPERLESS_API_BASE_URL=http://paperless-web:8000
  3. Start the service:

    docker compose up -d
  4. Verify it's running:

    docker compose logs -f paperless-anomaly-detector
  5. Access the dashboard:

    http://localhost:8050
    
  6. Trigger initial scan (optional):

    curl -X POST http://localhost:8050/api/trigger-scan

⚙️ Configuration

Environment Variables

Variable Default Description
PAPERLESS_API_BASE_URL http://paperless-web:8000 Paperless API endpoint
PAPERLESS_API_TOKEN (required) API authentication token
POLLING_INTERVAL 300 Seconds between polling cycles
BALANCE_TOLERANCE 0.01 Dollar tolerance for balance checks
LAYOUT_VARIANCE_THRESHOLD 0.3 Layout score threshold (0-1)
LLM_PROVIDER None anthropic or openai
LLM_API_KEY None LLM API key (if enabled)
LLM_MODEL (auto) Override model name
BATCH_SIZE 10 Documents per polling batch
DATABASE_URL sqlite:///data/anomalies.db Database connection string

Enabling LLM Analysis

Add to your environment:

environment:
  LLM_PROVIDER: anthropic
  LLM_API_KEY: sk-ant-api03-xxx

Or for OpenAI:

environment:
  LLM_PROVIDER: openai
  LLM_API_KEY: sk-proj-xxx
  LLM_MODEL: gpt-4-turbo-preview

Custom Fields Setup

The detector automatically creates these custom fields in Paperless:

  1. balance_check_status (Text): PASS / FAIL / NOT_APPLICABLE
  2. balance_diff_amount (Number): Dollar amount of mismatch
  3. layout_score (Number): 0-1 quality score

These are created on first run. No manual setup needed.

🔍 How It Works

Document Processing Flow

  1. Polling Phase:

    • Queries Paperless API every POLLING_INTERVAL seconds
    • Fetches documents with modified > last_seen
    • Processes in batches of BATCH_SIZE
  2. Content Extraction:

    • Retrieves OCR text via Paperless API
    • Extracts document metadata (title, date, tags)
    • Identifies document type (bank statement, invoice, etc.)
  3. Type Inference:

    • Keyword matching for common document types
    • Pattern recognition in content
    • Falls back to generic analysis if unrecognized
  4. Anomaly Detection:

    • Balance Validation: Extracts beginning/ending balances, credits, debits
    • Layout Analysis: Computes structural consistency score
    • Pattern Matching: Applies regex rules for common issues
    • LLM Enhancement (optional): Sends findings for analysis
  5. Results Storage:

    • Saves to internal database (processed_documents, anomaly_logs)
    • Includes severity, description, amounts, timestamps
  6. Paperless Integration:

    • Adds tags: anomaly:detected, anomaly:balance_mismatch, etc.
    • Updates custom fields with results
    • Never modifies original documents

Tag Naming Scheme

Tag Meaning
anomaly:detected At least one anomaly found
anomaly:balance_mismatch Arithmetic inconsistency detected
anomaly:layout_irregularity Formatting/structure issues
anomaly:duplicate_lines Repeated transaction entries
anomaly:truncated_total Missing or incomplete totals
anomaly:reversed_columns Debit/credit columns swapped
anomaly:page_numbering Page order issues

Manual Tags (Recommended):

  • property:<id> - Property identifier
  • role:referee or role:receiver - Your capacity
  • doc_type:bank_statement, doc_type:rent_roll - Document type
  • period:YYYY-MM - Time period

🖥️ Web Dashboard

Main Dashboard (/)

  • Document Cards: Visual cards for each processed document
  • Anomaly Indicators: Red badges for detected issues
  • Quick Stats: Total documents, anomalies found, success rate
  • Filters: Type, date range, amount threshold

Statistics (/api/stats)

JSON response with:

{
  "total_documents": 1234,
  "documents_with_anomalies": 45,
  "anomaly_rate": 3.6,
  "by_type": {
    "balance_mismatch": 20,
    "layout_irregularity": 15,
    "duplicate_lines": 10
  }
}

Document List (/api/documents)

Query parameters:

  • anomaly_type: Filter by specific anomaly
  • min_amount: Minimum balance discrepancy
  • max_amount: Maximum balance discrepancy
  • start_date: ISO format (2024-01-01)
  • end_date: ISO format (2024-12-31)
  • limit: Results per page (default: 50)
  • offset: Pagination offset

Example:

curl "http://localhost:8050/api/documents?anomaly_type=balance_mismatch&min_amount=100"

🔌 API Reference

GET /health

Health check endpoint.

Response: {"status": "healthy"}

GET /api/stats

Overall statistics.

Response:

{
  "total_documents": 1234,
  "documents_with_anomalies": 45,
  "anomaly_rate": 3.6,
  "by_type": {...}
}

GET /api/documents

List processed documents.

Query Params: See Document List section

Response:

{
  "documents": [...],
  "total": 1234,
  "limit": 50,
  "offset": 0
}

GET /api/anomalies

List all anomaly logs.

Query Params: document_id, severity, resolved

Response:

{
  "anomalies": [
    {
      "id": 123,
      "document_id": 456,
      "anomaly_type": "balance_mismatch",
      "severity": "high",
      "description": "Beginning + Credits - Debits != Ending",
      "amount": 150.00,
      "detected_at": "2024-01-15T10:30:00Z"
    }
  ]
}

POST /api/trigger-scan

Manually trigger document polling.

Response: {"status": "scan_initiated"}

🔗 Integration

NGINX Reverse Proxy

Add to your nginx.conf:

location /paperless-anomaly-detector/ {
    proxy_pass http://localhost:8050/;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    # URL rewriting for subpath
    sub_filter_once off;
    sub_filter 'href="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'href="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
    sub_filter 'src="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'src="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
    sub_filter 'action="proxy.php?url=https%3A%2F%2Fgithub.com%2F" 'action="proxy.php?url=https%3A%2F%2Fgithub.com%2Fpaperless-anomaly-detector%2F";
}

Then access at: https://yourdomain.com/paperless-anomaly-detector/

Paperless Workflow Integration

Create saved searches in Paperless:

  1. High-Priority Anomalies:

    tags:anomaly:balance_mismatch AND balance_diff_amount:>100
    
  2. Recent Anomalies:

    tags:anomaly:detected AND created:[now-7d TO now]
    
  3. Unresolved Issues:

    tags:anomaly:detected AND NOT tags:reviewed
    

🔧 Troubleshooting

No Documents Being Processed

Symptoms: Dashboard shows 0 documents

Solutions:

  1. Verify API connectivity:

    docker exec paperless-anomaly-detector curl -H "Authorization: Token YOUR_TOKEN" \
      http://paperless-web:8000/api/documents/?page_size=1
  2. Check logs for errors:

    docker compose logs -f paperless-anomaly-detector
  3. Manually trigger scan:

    curl -X POST http://localhost:8050/api/trigger-scan
  4. Check API token permissions in Paperless

False Positives

Symptoms: Too many anomalies detected

Solutions:

  1. Increase BALANCE_TOLERANCE:

    environment:
      BALANCE_TOLERANCE: 0.05  # $0.05 instead of $0.01
  2. Adjust LAYOUT_VARIANCE_THRESHOLD:

    environment:
      LAYOUT_VARIANCE_THRESHOLD: 0.5  # More lenient
  3. Review pattern detection rules in app/detector.py

  4. Use LLM enhancement for better context understanding

Performance Issues

Symptoms: Slow processing, high CPU usage

Solutions:

  1. Reduce batch size:

    environment:
      BATCH_SIZE: 5
  2. Increase polling interval:

    environment:
      POLLING_INTERVAL: 600  # 10 minutes
  3. Disable LLM if not needed:

    environment:
      LLM_PROVIDER: ""
  4. Use PostgreSQL instead of SQLite:

    environment:
      DATABASE_URL: postgresql://user:pass@postgres:5432/anomalies

LLM Not Working

Symptoms: No LLM-enhanced analysis, errors in logs

Solutions:

  1. Verify API key:

    docker exec paperless-anomaly-detector printenv | grep LLM
  2. Test API key manually:

    curl -H "x-api-key: YOUR_KEY" https://api.anthropic.com/v1/messages
  3. Check rate limits in provider dashboard

  4. Ensure LLM_PROVIDER is set correctly

💻 Development

Local Testing

# Install dependencies
pip install -r requirements.txt

# Run locally
cd app
python main.py

Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run tests
pytest tests/

# With coverage
pytest --cov=app tests/

Adding New Detection Algorithms

  1. Edit app/detector.py
  2. Add method to AnomalyDetector class:
    def detect_my_anomaly(self, content, metadata):
        """Detect my custom anomaly."""
        findings = []
        # Your logic here
        return findings
  3. Update detect_all_anomalies() to call your method
  4. Add corresponding tag handling
  5. Test thoroughly

Database Schema

processed_documents:

CREATE TABLE processed_documents (
    id INTEGER PRIMARY KEY,
    paperless_doc_id INTEGER UNIQUE,
    title TEXT,
    processed_at TIMESTAMP,
    has_anomalies BOOLEAN,
    balance_status TEXT,
    balance_diff REAL,
    layout_score REAL
);

anomaly_logs:

CREATE TABLE anomaly_logs (
    id INTEGER PRIMARY KEY,
    document_id INTEGER REFERENCES processed_documents(id),
    anomaly_type TEXT,
    severity TEXT,
    description TEXT,
    amount REAL,
    detected_at TIMESTAMP,
    resolved BOOLEAN DEFAULT 0
);

📈 Performance

Resource Usage

  • Memory: 200-500MB depending on document volume
  • CPU: Low during polling, spikes during processing
  • Disk: SQLite database grows ~10KB per document

Benchmarks

Typical processing times (Intel i7, 16GB RAM):

Document Type Pages Processing Time
Bank Statement 2 2-4 seconds
Invoice 1 1-2 seconds
Credit Card 5 5-8 seconds

With LLM enabled, add 1-3 seconds per document

Optimization

For high-volume deployments:

  • Use PostgreSQL instead of SQLite
  • Increase BATCH_SIZE for better throughput
  • Run multiple instances with partitioned document sets
  • Consider async processing with message queue

🔒 Security Notes

  1. API Token: Never logged or exposed in responses. Store in environment variable.

  2. Database: SQLite by default. Use PostgreSQL with encrypted connections for production.

  3. HTTPS: Always use NGINX reverse proxy with TLS in production.

  4. Access Control: Add HTTP basic auth via NGINX for additional security.

  5. Read-Only: Service only reads documents and writes tags/fields. Never modifies originals.

  6. Audit Trail: All actions logged with timestamps in application logs.

❓ FAQ

Q: Can I reprocess documents? A: Yes, clear the database and restart: docker exec paperless-anomaly-detector rm /app/data/anomalies.db

Q: Does this work with scanned documents? A: Yes, as long as Paperless has performed OCR. Quality depends on scan quality.

Q: Can I customize which anomalies are detected? A: Yes, edit app/detector.py to add/remove detection rules.

Q: What document types are supported? A: Bank statements, credit cards, invoices, receipts. Easily extensible.

Q: How accurate is the balance validation? A: Very accurate for properly formatted statements. Configure tolerance for edge cases.

Q: Can I use this without LLM? A: Yes, deterministic checks work fine without LLM. LLM is optional enhancement.

Q: Does this modify my documents? A: No, it only adds tags and custom fields. Original PDFs are never modified.

Q: Can I run this on multiple Paperless instances? A: Run separate containers with different PAPERLESS_API_TOKEN values.

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a Pull Request

📜 License

MIT License - see LICENSE file for details.

🙏 Credits

📞 Support


Perfect for property managers, accountants, auditors, and anyone who needs automated financial document validation.

About

Automated anomaly detection system for Paperless-ngx documents with financial analysis, pattern detection, and optional LLM enhancement

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors