A comprehensive Python application that extracts transactions from bank statement PDFs, categorizes business expenses, and generates filled IRS Schedule C tax forms automatically. Features a modular parser system supporting multiple major banks with intelligent transaction detection and AI-powered categorization.
- Navy Federal Credit Union: Both checking/savings (MM-DD format) and credit card (MM/DD/YY format) statements
- Capital One: Credit card statements with transaction/post date format
- Citibank: Complex multi-line transaction formats with orphaned amount matching
- Chase: Standard and no-date transaction patterns with PayPal support
- Bank of America: Traditional bank statement formats
- Generic Parser: AI-powered fallback parser using K-means clustering for unknown statement formats
- Automatic Detection: Intelligent bank identification from PDF content with fallback support
- PDF Transaction Extraction: Automatically extract transactions from bank statement PDFs
- Modular Parser System: Plugin-based architecture for easy bank format extension
- Machine Learning Parser: K-means clustering algorithm automatically detects transaction patterns in unknown statement formats
- AI-Powered Categorization: Optional OpenAI integration for intelligent expense categorization
- Transaction Management: Delete, edit, and manage transactions with real-time updates
- Multiprocessing Support: Parallel processing for faster PDF analysis
- Schedule C Generation: Generate filled IRS Schedule C PDFs with correct field mappings
- Category Management: Configurable business expense categories with learning capabilities
- Excel Export: Detailed transaction reports with category summaries
- Search & Filter: Advanced transaction search and filtering capabilities
- Interactive GUI: PyQt6-based graphical interface for easy use
- Visual Field Mapper: Interactive PyQt6 tool to create and modify PDF field mappings
- Real-time Status: Live processing updates and bank detection feedback
- Context Menus: Right-click transaction management and bulk operations
- Installation
- Quick Start
- Core Components
- Usage Guide
- Configuration
- Advanced Features
- Troubleshooting
The easiest way to install Statement Organizer is using the automated installer:
-
Download the project:
git clone <repository-url> cd statement_organizer
-
Run the installer:
python3 install.py
The installer will:
- ✅ Detect your operating system (Windows, macOS, Linux, BSD)
- ✅ Check for Python 3.12+ (download if needed)
- ✅ Install all required dependencies automatically
- ✅ Create executable scripts for easy application launching
- ✅ Set up the config directory structure
If you prefer manual installation:
-
Prerequisites:
- Python 3.12+
- Virtual environment (recommended)
-
Setup:
git clone <repository-url> cd statement_organizer python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
After installation, use the generated scripts:
Windows:
bank_statement_gui.bat # Main GUI application
pdf_field_mapper.bat # PDF field mapping tool
final_schedule_c_filler.bat # Schedule C PDF processorUnix (Linux/macOS/BSD):
./bank_statement_gui.sh # Main GUI application
./pdf_field_mapper.sh # PDF field mapping tool
./final_schedule_c_filler.sh # Schedule C PDF processorWindows:
bank_statement_gui.batUnix (Linux/macOS/BSD):
./bank_statement_gui.sh-
Load Bank Statements: Click "Add PDFs" to select your bank statement files
- Supports: Navy Federal, Capital One, Citibank, Chase, Bank of America
- Automatically detects bank format from PDF content
- Can process multiple banks simultaneously
-
Configure Categories: Choose your business categories file (or use default)
- Optional: Enable AI categorization for intelligent expense classification
-
Process Transactions: Click "Process PDFs" to analyze transactions
- Real-time status updates show bank detection and extraction progress
- Multiprocessing for faster analysis of multiple files
-
Manage Transactions: Review and manage extracted transactions
- Edit categories with dropdown menus
- Delete unwanted transactions with right-click or delete button
- Search and filter transactions by description
-
Export Results: Generate reports and tax forms
- Export to Excel with category summaries
- Generate filled Schedule C PDF forms
source venv/bin/activate
python final_schedule_c_filler.pyThis will:
- Process all PDFs in the
Statements/folder - Categorize transactions automatically
- Generate
schedule_c_final_filled.pdf
statement_organizer/
├── bank_parsers/ # Core parser implementations
│ ├── base_parser.py # Base parser interface
│ ├── navy_federal.py # Navy Federal parser
│ ├── capital_one.py # Capital One parser
│ ├── citibank.py # Citibank parser
│ ├── chase.py # Chase parser
│ ├── bank_of_america.py # Bank of America parser
│ ├── generic_parser.py # AI-powered generic parser
│ └── ml_parser.py # ML-based parser (99.7% accuracy)
├── ml_models/ # Trained ML models
├── config/ # Configuration files
├── Statements/ # Input PDF statements
├── utils/ # Development and analysis tools
├── bank_statement_analyzer.py # Core analyzer engine
├── bank_statement_gui.py # GUI application
└── requirements.txt # Dependencies
Plugin-based architecture for parsing different bank statement formats.
Key Features:
- Auto-Detection: Automatically identifies bank format from PDF content
- Extensible Design: Easy to add new bank parsers by implementing the interface
- Standardized Output: All parsers return consistent transaction format
- Fallback System: Uses generic parser if bank not recognized
Supported Banks:
navy_federal.py: Navy Federal Credit Union (checking & credit card formats)capital_one.py: Capital One credit card statementscitibank.py: Citibank statements with complex multi-line transactionschase.py: Chase statements including PayPal and no-date transactionsbank_of_america.py: Bank of America traditional formats
The core engine that orchestrates transaction extraction and categorization.
Key Features:
- PDF text extraction using pdfplumber
- Modular parser integration with automatic bank detection
- Multiprocessing support for parallel PDF processing
- Pattern-based and AI-powered categorization
- Schedule C data generation with field mapping
PyQt6-based graphical interface for comprehensive transaction management.
Features:
- Multi-bank PDF loading with automatic detection
- Real-time processing progress with bank identification
- Interactive transaction review and editing
- Transaction deletion with confirmation dialogs
- Category management with learning capabilities
- Search and filter functionality
- Export capabilities (Excel and Schedule C PDF)
- AI-Powered Categorization (optional)
The GUI includes an intelligent "Use AI Categorization" checkbox that enhances transaction categorization accuracy:
🤖 How It Works:
- Auto-Detection: Checkbox automatically enables if
openai.txtfile exists and OpenAI package is installed - Smart Processing: Uses GPT-3.5-turbo to categorize ambiguous transactions
- Fallback System: If AI fails, automatically falls back to pattern matching
- Real-Time Feedback: Shows AI categorization progress in the status display area
📋 Categorization Priority:
- Learned Categories (from previous manual corrections)
- AI Categorization (if enabled and available)
- Pattern Matching (keyword-based)
- Default: "Other Business Expenses"
⚙️ Setup Requirements:
- Install OpenAI package:
pip install openai - Create
openai.txtfile with your API key - Checkbox will auto-enable when requirements are met
💡 Benefits:
- Higher Accuracy: AI understands context better than simple keyword matching
- Learns Context: Considers transaction amounts, descriptions, and business categories
- Cost Effective: Uses GPT-3.5-turbo for affordable processing
- Transparent: Status area shows which transactions are AI-categorized
- Optional: Can be disabled to use only pattern matching
Purpose: Alternative PDF processing tool
Key Features:
- PDF form field analysis
- Direct form filling capabilities
- Schedule C specific processing
- Configurable field mappings export/import
Generates filled IRS Schedule C PDFs with accurate field mappings.
Capabilities:
- JSON-based field mapping configuration
- Automatic PDF form filling
- Multiple mapping strategies
- Error handling and validation
Interactive PyQt6 tool for creating and modifying PDF field mappings.
Features:
- Visual PDF display with clickable field overlays
- Interactive field selection and mapping
- Business category dropdown selection
- Tree view of current mappings
- Save/load JSON mapping configurations
- Page navigation for multi-page PDFs
- Real-time field highlighting
-
Launch the application:
python bank_statement_gui.py
-
Load Bank Statement PDFs:
- Click "Add PDFs" or drag-and-drop files
- Select multiple bank statement PDFs from any supported bank
- Supported Banks: Navy Federal, Capital One, Citibank, Chase, Bank of America
- Application automatically detects bank format from PDF content
-
Configure Categories & AI:
- Use default
business_categories.jsonor load custom configuration - Optional: Enable "Use AI Categorization" checkbox for intelligent expense classification
- Categories determine how expenses are classified for tax purposes
- Use default
-
Process Transactions:
- Click "Process PDFs" to start extraction
- Real-time Status: Monitor bank detection and extraction progress
- Multiprocessing: Multiple PDFs processed in parallel for speed
- Review extracted transactions in the interactive table
-
Manage Transactions:
- Edit Categories: Use dropdown menus to change transaction categories
- Delete Transactions: Right-click or use delete button to remove unwanted entries
- Search & Filter: Find specific transactions by description
- Bulk Operations: Apply categories to multiple similar transactions
-
Export Results:
- Excel Export: Generate detailed transaction reports with category summaries
- Schedule C PDF: Create filled IRS tax forms with expense totals
- Category Learning: System remembers manual corrections for future processing
-
Place PDFs in Statements folder:
mkdir -p Statements cp your_bank_statements.pdf Statements/
-
Run the processor:
python final_schedule_c_filler.py
-
Check output:
- Generated PDF:
schedule_c_final_filled.pdf - Processing logs show categorization details
- Generated PDF:
Use the interactive field mapper to create custom PDF mappings:
-
Launch the field mapper:
python pdf_field_mapper.py
-
Open your PDF form:
- Click "Open PDF"
- Select the PDF form you want to map
- Navigate through pages if needed
-
Map fields visually:
- Red overlays show available form fields
- Click on a field to select it
- Choose expense category from dropdown
- Click "Map Selected Field"
-
Save configuration:
- Click "Save Mapping"
- Export as JSON file
- Use with Schedule C processor
Use the interactive GUI to create accurate PDF field mappings:
-
Launch the field mapper:
python pdf_field_mapper.py
-
Open your PDF form:
- Click "Open PDF"
- Select the PDF form you want to map
- Navigate through pages if needed
-
Map fields visually:
- Red overlays show available form fields
- Click on a field to select it
- Choose expense category from dropdown
- Click "Map Selected Field"
-
Save configuration:
- Click "Save Mapping"
- Export as JSON file
- Use with Schedule C processor
The modular parser system enables support for multiple bank statement formats through a plugin-based architecture:
All bank parsers implement the BankStatementParser interface:
class BankStatementParser:
def can_parse(self, text: str) -> bool:
"""Detect if this parser can handle the PDF content"""
def extract_transactions(self, text: str) -> List[Dict]:
"""Extract transactions with bank-specific logic"""
def get_account_info(self, text: str) -> Dict:
"""Extract account metadata (number, dates, etc.)"""The system automatically identifies bank formats using unique identifiers:
- Navy Federal: "Navy Federal", "NFCU", "Navy Federal Credit Union"
- Capital One: "Capital One", "capitalone.com", "CAPITAL ONE"
- Citibank: "Citibank", "CITI", "citicards.com"
- Chase: "Chase", "JPMorgan Chase", "CHASE CARD SERVICES"
- Bank of America: "Bank of America", "BofA", "bankofamerica.com"
The BankParserRegistry manages parser detection and priority:
# Priority order (highest to lowest)
1. Navy Federal Credit Union
2. Capital One
3. Citibank
4. Chase
5. Bank of America
6. Generic fallback parserTo add support for a new bank:
- Create parser file:
bank_parsers/new_bank.py - Implement interface: Extend
BankStatementParser - Add detection logic: Unique bank identifiers
- Register parser: Add to
bank_parsers/registry.py - Test thoroughly: Create test cases for various statement formats
All parsers return transactions in this standardized format:
{
'date': datetime.date,
'description': str,
'amount': float,
'category': str,
'transaction_type': str # 'debit' or 'credit'
}The system uses JSON configuration files stored in the config/ folder:
-
Business Categories (
config/business_categories.json):- Define expense categorization rules
- Add keywords for automatic matching
- Customize categories for your business
-
Field Mappings (
config/schedule_c_field_mappings.json):- Maps business categories to PDF form fields
- Uses precise field patterns (f1_35, f1_27, etc.)
- Ensures accurate form filling
-
Schedule C Form (
config/schedule_c.pdf):- Official IRS Schedule C form
- Used as template for filling
- Must be a fillable PDF form
Define how transactions are categorized:
{
"Meals & Entertainment": [
"restaurant",
"cafe",
"doordash"
],
"Software & Subscriptions": [
"github",
"aws",
"google cloud"
],
"Marketing": [
"facebook ads",
"google ads"
]
}Structure:
- Keys: Business expense categories
- Values: Arrays of keywords/patterns to match
Maps expense categories to PDF form fields:
{
"schedule_c_mappings": {
"Car and truck expenses": {
"line": "9",
"field_pattern": "f1_36",
"description": "Schedule C Line 9"
}
}
}Fields:
- line: IRS Schedule C line number
- field_pattern: PDF form field identifier
- description: Human-readable description
The system includes an advanced Generic Parser that uses machine learning to automatically detect transaction patterns in unknown statement formats:
- K-means Clustering: Groups PDF lines by layout features (position, money presence, dates, text characteristics)
- Pattern Detection: Automatically generates regex patterns for transaction extraction
- Smart Filtering: Removes summary/header content using keyword detection
- Fallback Support: Activates when bank-specific parsers fail
- Layout Analysis: Analyzes PDF character positioning and line structure
- Feature Extraction: Uses 13+ features including money patterns, date presence, and spatial positioning
- Automatic Regex Generation: Creates custom regex patterns based on detected transaction clusters
- Multi-format Support: Handles various date formats, currency symbols, and layout styles
The Generic Parser automatically activates as a fallback when:
- No specific bank parser can handle the PDF
- Statement format is unknown or unsupported
- Bank-specific parser fails to extract transactions
# The Generic Parser is automatically registered and used
from bank_parsers.generic_regex import GenericRegexParser
# Test if a PDF can be parsed
parser = GenericRegexParser()
if parser.can_parse(pdf_text):
transactions = parser.extract_transactions(pdf_text)Requires additional machine learning packages:
pip install scikit-learn numpy pandasThe regex_builder.py is a standalone analysis tool for discovering transaction patterns in bank statements that aren't picked up by existing parsers. It uses the same K-means clustering approach as the Generic Parser but provides detailed analysis and visual output.
- Pattern Discovery: Analyze unsupported bank statement formats to understand their structure
- Regex Generation: Automatically generate regex patterns for new bank statement types
- Visual Analysis: Create visual representations of PDF layout and detected patterns
- Development Aid: Help developers create new bank-specific parsers
Basic Analysis:
# Activate virtual environment with ML dependencies
source venv_new/bin/activate
# Analyze a single PDF statement
python regex_builder.py path/to/statement.pdfVisual Analysis with --draw flag:
# Generate visual analysis images (requires Pillow)
python regex_builder.py path/to/statement.pdf --drawAutomatic Pattern Detection:
- K-means clustering of PDF lines by layout features
- Automatic identification of transaction-like content
- Smart filtering of headers, summaries, and non-transaction content
- Generation of flexible regex patterns without hardcoded literals
Visual Analysis (--draw flag):
- Page Layout Visualization: Shows PDF structure with detected lines
- Cluster Analysis: Color-coded visualization of different line clusters
- Transaction Highlighting: Visual identification of detected transaction patterns
- Pattern Guides: Visual guides showing regex pattern matching areas
Console Output:
Processing PDF with 4 pages...
--- Page 1 ---
Extracted 66 lines from page 1
Clustering produced 4 clusters
Chosen cluster for transactions: 1 (score: 102.61)
Found 12 transaction-like lines
Generated regex pattern: ^\s*(?:\d{1,2}[/-]\d{1,2}...)
Matched 12 out of 12 transaction lines
Generated Files:
transactions_extracted.csv: Extracted transaction datapage_N_analysis.png: Visual analysis images (with --draw)- Console output with learned regex patterns
Use regex_builder.py when:
- Bank statements aren't supported by existing parsers
- Generic Parser fails to extract transactions properly
- You need to understand the structure of a new statement format
- Developing a new bank-specific parser
- Debugging transaction extraction issues
Example Workflow:
- Run basic analysis to see if transactions are detected
- Use --draw flag to visualize the PDF structure and clustering
- Examine generated regex patterns for new parser development
- Review extracted CSV to validate transaction accuracy
- Iterate and refine patterns based on results
The regex_builder.py tool serves as the foundation for the Generic Parser:
- Same K-means clustering algorithm
- Same pattern detection logic
- Provides detailed analysis that the Generic Parser uses automatically
- Useful for debugging when Generic Parser performance is suboptimal
Extend transaction recognition by modifying extract_transactions() in bank_statement_analyzer.py:
# Add custom patterns for your bank's format
transaction_patterns = [
r'(\d{2}/\d{2}/\d{4})\s+(.+?)\s+(-?\$?[\d,]+\.?\d*)',
# Add your bank's specific pattern here
]The system supports various bank statement formats:
- Standard date-description-amount layouts
- Multi-column formats
- Different date formats (MM/DD/YYYY, DD/MM/YYYY)
Process multiple months/years of statements:
from bank_statement_analyzer import BankStatementAnalyzer
analyzer = BankStatementAnalyzer()
pdf_files = glob.glob("Statements/*.pdf")
analyzer.extract_from_multiple_pdfs(pdf_files)
analyzer.categorize_transactions()
data = analyzer.generate_schedule_c_data()Export data in multiple formats:
# Export to Excel
analyzer.export_to_excel("transactions.xlsx")
# Export to CSV
df = analyzer.get_transactions_dataframe()
df.to_csv("transactions.csv", index=False)
# Export Schedule C data
schedule_data = analyzer.generate_schedule_c_data()
with open("schedule_c_data.json", "w") as f:
json.dump(schedule_data, f, indent=2)The system includes a comprehensive test suite for evaluating the Generic Parser's performance:
# Activate virtual environment
source venv_new/bin/activate
# Run comprehensive tests on all PDFs
python test_generic_parser.pyThe test script evaluates:
- Success Rate: Percentage of PDFs successfully parsed
- Transaction Count: Number of transactions extracted per file
- Parser Comparison: Generic parser vs bank-specific parsers
- Performance Metrics: Processing time and accuracy
GENERIC PARSER TEST RESULTS
============================================================
Total files tested: 20
Successful extractions: 5
Success rate: 25.0%
Top performing files:
1. 2024-09-09_VISASTMT.pdf: 16 transactions
2. Statement_012025_9746.pdf: 24 transactions
3. Statement_022025_9746.pdf: 20 transactions
- High transaction counts: Indicates good pattern detection
- Low success rates: May indicate need for additional training data
- Zero transactions: Could indicate PDF format incompatibility
To enhance Generic Parser accuracy:
- Add training data: Include more diverse PDF formats in K_cluster_test/
- Adjust clustering parameters: Modify n_clusters in
cluster_transactions() - Update filtering keywords: Enhance
summary_keywordslist - Refine regex patterns: Improve date and money detection patterns
The test_all_parsers.py script provides comprehensive testing of all parsers against your entire PDF collection to evaluate system-wide performance and identify areas for improvement.
- System-wide Evaluation: Test all parsers against all PDFs in the Statements directory
- Performance Metrics: Generate detailed statistics on parser success rates and transaction extraction
- Comparative Analysis: Compare performance between different parsers on the same files
- Quality Assurance: Identify failing files and parser detection issues
# Activate virtual environment
source venv_new/bin/activate
# Run comprehensive parser testing
python test_all_parsers.pyComprehensive Testing:
- Tests all PDFs in the Statements directory automatically
- Evaluates both automatic parser detection and individual parser performance
- Measures processing time and success rates
- Identifies files that fail to parse
Detailed Analysis:
- Parser Detection Results: Shows which parser was automatically selected for each file
- Individual Parser Performance: Tests each parser against all files to show capability
- Success Rate Metrics: Calculates success percentages and transaction counts
- Performance Statistics: Processing time analysis and optimization insights
Console Progress:
Comprehensive Parser Efficacy Test
==================================================
Found 180 PDF files to test
Progress: 95/180 (52.8%)
Testing: eStmt_2024-05-31.pdf
Detected parser: Bank of America
Detected parser result: 45 transactions
Navy Federal: ✗ (0 transactions)
Capital One: ✗ (0 transactions)
Citibank: ✗ (0 transactions)
Chase: ✗ (0 transactions)
Bank of America: ✓ (45 transactions)
Generic: ✓ (12 transactions)
Final Results:
============================================================
TEST COMPLETED
============================================================
Total files tested: 180
Total testing time: 32.38 seconds
Detailed statistics saved to: statistics.txt
Quick Summary:
Successfully parsed: 172/180 (95.6%)
Average processing time: 0.18s per file
statistics.txt - Comprehensive analysis including:
- Overall Statistics: Success rates and failure analysis
- Parser Detection Results: Which parsers were selected and their performance
- Individual Parser Performance: Detailed breakdown of each parser's capabilities
- Failed Files Analysis: Specific files that couldn't be parsed and why
- Top Performing Files: Best extraction results with transaction counts
- Performance Metrics: Processing time statistics
- Recommendations: Actionable insights for system improvement
Success Rate Analysis:
- 95%+ success rate: Excellent system performance
- 80-95% success rate: Good performance with room for optimization
- <80% success rate: May indicate parser detection or format compatibility issues
Parser Performance Indicators:
- High transaction counts: Indicates effective pattern recognition
- 100% success on detected files: Shows parser accuracy when properly matched
- Low detection rates: May indicate overly restrictive
can_parse()methods
Identify Improvement Areas:
- Failed Files: Use
regex_builder.pyto analyze unsupported formats - Low Detection Rates: Adjust parser
can_parse()methods for better coverage - Performance Issues: Optimize slow parsers or add more specific patterns
- Generic Parser Tuning: Use results to improve fallback parser accuracy
System Monitoring:
- Run periodically to ensure consistent performance
- Compare results after parser updates or new bank statement formats
- Track improvements in success rates over time
Regular Testing:
- Run after adding new parsers or modifying existing ones
- Test with new statement formats before production use
- Monitor performance after system updates
Result Analysis:
- Focus on files with zero transactions - may indicate parsing issues
- Compare individual parser results with detection results to identify mismatches
- Use performance metrics to optimize processing speed
Problem: "No transactions found in PDF" Solution:
- Verify PDF is text-based (not scanned image)
- Check if PDF format matches expected patterns
- Try different PDF extraction methods
Problem: "Numbers appear in wrong PDF fields" Solution:
- Use the visual field mapper tool
- Create custom field mapping configuration
- Verify PDF form field names
Problem: "Transactions not categorized correctly" Solutions:
- Update
business_categories.jsonwith better keywords - Add specific merchant names to categories
- Update business categories with more specific keywords
For large numbers of PDFs:
-
Process in batches:
# Process 10 PDFs at a time for batch in chunks(pdf_files, 10): analyzer.extract_from_multiple_pdfs(batch)
-
Use multiprocessing:
from multiprocessing import Pool with Pool() as pool: results = pool.map(process_pdf, pdf_files)
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)
analyzer = BankStatementAnalyzer()
# Detailed logs will show extraction and categorization stepsstatement_organizer/
├── config/ # Configuration files
│ ├── business_categories.json # Expense categorization rules
│ ├── schedule_c_field_mappings.json # PDF field mappings
│ ├── learned_categories.json # Learned categorization patterns
│ └── schedule_c.pdf # IRS Schedule C form template
├── bank_parsers/ # Modular bank parser system
│ ├── __init__.py # Parser base classes and registry
│ ├── registry.py # Parser registration and detection
│ ├── navy_federal.py # Navy Federal Credit Union parser
│ ├── capital_one.py # Capital One parser
│ ├── citibank.py # Citibank parser
│ ├── chase.py # Chase parser
│ ├── bank_of_america.py # Bank of America parser
│ └── generic_regex.py # ML-powered generic parser
├── bank_statement_analyzer.py # Core transaction extraction
├── bank_statement_gui.py # Main GUI interface
├── final_schedule_c_filler.py # Main PDF form filler
├── schedule_c_processor.py # Alternative Schedule C processor
├── pdf_field_mapper.py # Field mapping utilities
├── create_categories.py # Category creation tool
├── test_generic_parser.py # Generic parser test suite
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── venv/ # Virtual environment
└── README.md # This file
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the GNU GPL3 License - see the LICENSE file for details.
For issues and questions:
- Check the troubleshooting section
- Review the configuration files
- Enable debug logging for detailed information
- Create an issue with detailed error information
Note: This tool is designed to assist with tax preparation but should not replace professional tax advice. Always review generated forms and consult with a tax professional for complex situations.