Automatic Requirements Extraction Tool for PDF Specifications
ReqBot is a powerful desktop application that automatically extracts requirements from PDF specification documents using advanced NLP techniques. It generates compliance matrices in Excel format, BASIL-compatible SPDX exports, and creates annotated PDFs with highlighted requirements.
Version 2.2.0 focuses on quality of life improvements and performance:
- ๐ Search & Filter in Results (Planned) - Find and filter extracted requirements in GUI
- ๐ Parallel PDF Processing (Planned) - 2-3x faster batch operations
- ๐ฏ Drag & Drop Support (Planned) - Drag PDFs directly into application
- ๐บ Real-Time Preview (Planned) - Preview requirements during processing
- ๐งช Enhanced Testing (Planned) - 80%+ test coverage, CI/CD pipeline
- ๐ Comprehensive Documentation (Planned) - Video tutorials, user manual, FAQ
๐ See Full v2.2 Release Notes
Critical bug fix for threading and UX enhancements:
- ๐ Fixed Threading Issue - Users can now run multiple sequential extractions without restarting
- ๐ Recent Files/Projects - Quick access to last 5 used paths via dropdown menus
- ๐๏ธ Adjustable Confidence Threshold - Interactive slider control (0.0-1.0)
- ๐ Recent Files/Projects - Quick access to last 5 used paths via dropdown menus
- ๐๏ธ Adjustable Confidence Threshold - Interactive slider control (0.0-1.0) with real-time filtering
- ๐ Confidence Display in Excel - Color-coded confidence scores (Green/Yellow/Red) with auto-filtering
- ๐ง Excel Column Corrections - Fixed Priority column positioning after Confidence addition
- ๐ BASIL SPDX 3.0.1 Integration - Automatic export of requirements to industry-standard SPDX format
- ๐ HTML Processing Reports - Automatic generation of detailed processing reports with statistics
- ๐ท๏ธ Requirement Categorization - Auto-categorize requirements into 9 categories (Functional, Safety, Performance, etc.)
- ๐ฏ Keyword Profiles - Predefined and custom keyword sets for different domains (Aerospace, Medical, Automotive, etc.)
- โ Enhanced Testing - 270+ tests organized in tests/ directory (including threading fix verification)
- NLP-Powered: Uses spaCy for accurate sentence segmentation and requirement identification
- Exact Word Matching: Precise keyword matching with regex word boundaries
- Pattern Recognition: Detects 6 categories of requirement patterns (modal verbs, compliance indicators, etc.)
- Quality Scoring: Confidence scores (0.0-1.0) for every extracted requirement
- Multi-Column Layout Support: Correctly processes technical specs with multiple columns
- Smart Text Preprocessing: Handles hyphenated words, Unicode normalization, page number removal
- Highlight Size Validation: Prevents oversized highlights (max 40% page coverage)
- Fallback Annotations: Text-only notes when highlighting fails
- Color-Coded Priorities: Visual priority indicators (High/Medium/Low/Security)
- Confidence Scoring: Color-coded confidence levels (Green โฅ0.8, Yellow 0.6-0.8, Red <0.6)
- Requirement Categorization: Automatic classification into 9 categories (Functional, Safety, Performance, Security, Interface, Data, Compliance, Documentation, Testing)
- Data Validations: Dropdown lists for standardized input
- Formula Integration: Automated compliance score calculations
- Template-Based: Uses customizable Excel templates
- Modern GUI: PySide6-based desktop application
- Background Processing: Non-blocking UI with progress tracking
- Comprehensive Logging: Detailed logs for debugging and quality assurance
- Recent Projects: Quick access to last 5 used paths via dropdown menus
- Keyword Profiles: Predefined profiles for different domains (Aerospace, Medical, Automotive, Software, Safety, Generic)
- Processing Reports: Automatic HTML reports with statistics, quality metrics, and warnings
- Industry Standard Export: Automatic export to SPDX 3.0.1 JSON-LD format (BASIL-compatible)
- Requirement Interchange: Share requirements with other tools using standardized SPDX format
- Import Support: Import existing BASIL requirements with merge strategies (append, update, replace)
- Validation: Built-in SPDX format validation to ensure compliance
- Traceability: Maintains requirement IDs and metadata for traceability
| Metric | Before v2.0 | After v2.0 | Improvement |
|---|---|---|---|
| Full-Page Highlight Errors | ~15% | <1% | 93% reduction |
| False Positive Extractions | ~30% | ~5-8% | 73-83% reduction |
| Processing Speed | Baseline | 3-5x | 300-400% faster |
| PDF Parsing Quality | Fair | Excellent | 40-50% better |
- Python 3.8+ (Required for PySide6 compatibility)
- Microsoft Build Tools 2022 (Windows only, for certain dependencies)
git clone https://github.com/francosax/ReqBot.git
cd ReqBotpython -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activatepip install -r requirements.txt# Install specific version compatible with Python 3.8
pip install spacy==3.4.4
# Download the English language model
python -m spacy download en_core_web_smNote: en_core_web_sm is not a standard pip package. It's a pre-trained spaCy model installed via the spacy download command.
python run_app.pyThis presents an interactive menu:
- Run GUI Application
- Run Tests
- Exit
python main_app.py- Select Input Folder: Choose folder containing PDF specification files (or use Recent dropdown)
- Select Output Folder: Choose where to save results (or use Recent dropdown)
- Load Compliance Matrix Template: Select Excel template file (or use Recent dropdown)
- Choose Keyword Profile (Optional): Select from Generic, Aerospace, Medical, Automotive, Software, or Safety profiles
- Adjust Confidence Threshold (Optional): Use slider to set minimum confidence (0.0-1.0, default 0.5)
- Start Processing: Click "Start" to begin extraction
- Review Results:
- Excel compliance matrix with requirements, confidence scores, and categories
- Annotated PDF with highlighted requirements
- BASIL SPDX 3.0.1 JSON-LD export
- HTML processing report with statistics and quality metrics
ReqBot/
โโโ main_app.py # Main GUI application (v2.1.1 threading fix)
โโโ run_app.py # Interactive launcher menu
โโโ version.py # Version information (single source of truth)
โโโ RB_coordinator.py # Processing pipeline orchestrator
โโโ pdf_analyzer.py # NLP extraction engine (v2.0 enhanced)
โโโ pdf_analyzer_multilingual.py # Multilingual NLP extraction (v3.0 in progress)
โโโ highlight_requirements.py # PDF annotation module (v2.0 enhanced)
โโโ excel_writer.py # Excel matrix generator
โโโ basil_integration.py # BASIL SPDX 3.0.1 export/import
โโโ report_generator.py # HTML processing report generator
โโโ requirement_categorizer.py # Automatic requirement categorization
โโโ keyword_profiles.py # Keyword profile management
โโโ recent_projects.py # Recent files/folders manager
โโโ language_detector.py # Language detection (v3.0 in progress)
โโโ language_config.py # Language configuration (v3.0 in progress)
โโโ multilingual_nlp.py # Multilingual NLP support (v3.0 in progress)
โโโ config_RB.py # Configuration manager
โโโ processing_worker.py # Background thread worker
โโโ get_all_files.py # File utilities
โโโ RBconfig.ini # Keyword configuration
โโโ keyword_profiles.json # Keyword profiles storage
โโโ recents_config.json # Recent paths storage
โโโ template_compliance_matrix.xlsx # Excel template
โโโ database/ # Database backend (v3.0 in progress)
โ โโโ __init__.py
โ โโโ database.py # Database initialization
โ โโโ models.py # SQLAlchemy models
โ โโโ services/ # Database services
โโโ tests/ # Test suite (270+ tests)
โ โโโ conftest.py # Pytest configuration
โ โโโ test_gui.py # GUI tests (8 tests)
โ โโโ test_excel_writer.py # Excel tests (3 tests)
โ โโโ test_highlight_requirements.py # PDF highlighting tests (2 tests)
โ โโโ test_basil_integration.py # BASIL tests (25 tests)
โ โโโ test_report_generator.py # Report generator tests
โ โโโ test_database_*.py # Database tests (v3.0)
โ โโโ test_language_*.py # Language detection tests (v3.0)
โ โโโ test_multilingual_nlp.py # Multilingual NLP tests (v3.0)
โ โโโ test_integration*.py # Integration tests
โโโ CLAUDE.md # Developer documentation
โโโ README.md # This file
โโโ TODO.md # Project roadmap
โโโ requirements.txt # Python dependencies
Edit RBconfig.ini to customize requirement keywords:
[RequirementKeywords]
keywords = shall, must, should, will, has to, require, requiredEdit constants in pdf_analyzer.py:
MIN_REQUIREMENT_LENGTH_WORDS = 5 # Minimum sentence length
MAX_REQUIREMENT_LENGTH_WORDS = 100 # Maximum sentence lengthEdit constants in highlight_requirements.py:
MAX_HIGHLIGHT_COVERAGE_PERCENT = 40 # Maximum page coverageSheet Name: MACHINE COMP. MATRIX
Columns Include:
- Label Number
- Description (requirement text)
- Page number
- Keyword that triggered extraction
- Confidence score with color coding (NEW in v2.0)
- Priority (High/Medium/Low/Security)
- Category (Functional, Safety, Performance, etc.) (NEW in v2.1)
- Compliance status fields
- Verification fields
- Notes and metadata
Starting Row: Data written from row 5 (rows 1-4 reserved for headers)
- Yellow highlights on requirement text
- Text annotations with label and description
- Fallback annotations for oversized or unfound highlights
Filename: YYYY.MM.DD_HHMMSS_Processing_Report.html
Contents:
- Processing summary with total requirements extracted
- Average confidence score across all requirements
- File-by-file breakdown with statistics
- Warnings and errors encountered
- Quality metrics and visualizations
- Color-coded confidence indicators
Run the comprehensive test suite:
# All tests (from project root)
pytest -v
# Specific test file
pytest tests/test_gui.py -v
# Specific test
pytest tests/test_gui.py::test_threading_fix_prevents_double_start -v
# With coverage
pytest --cov=. --cov-report=htmlTest Coverage (270+ tests organized in tests/ directory):
- โ GUI components (8 tests) - Threading, UI elements, user interactions
- โ Excel writer functionality (3 tests) - Matrix generation, formatting
- โ PDF highlighting (2 tests) - Annotation, highlight validation
- โ BASIL integration (25 tests) - Export, import, validation, merge strategies
- โ Report generator - HTML report generation with statistics
- โ Database models and services (v3.0 in progress) - SQLAlchemy ORM tests
- โ Language detection (v3.0 in progress) - Multilingual support tests
- โ Multilingual NLP (v3.0 in progress) - Multi-language extraction tests
- โ Integration tests - End-to-end workflow validation
- โ Thread safety tests - Concurrent operations validation
Current Status: 263 passing, 4 pre-existing failures (unrelated to core functionality), 3 pre-existing errors
- CLAUDE.md - Comprehensive developer guide for AI assistants
- NLP_IMPROVEMENT_ANALYSIS.md - Detailed technical analysis of improvements
- RELEASE_NOTES_v2.0.md - Version 2.0 release notes
- UI Framework: PySide6 (Qt for Python)
- PDF Processing: PyMuPDF (fitz)
- NLP Engine: spaCy 3.4.4 with en_core_web_sm model
- Data Handling: Pandas, openpyxl
- Export Formats: JSON-LD (SPDX 3.0.1), HTML reports
- Testing: pytest, pytest-qt (270+ tests)
- Threading: QThread for non-blocking operations
- Configuration: ConfigParser, JSON
- Database: SQLAlchemy (PostgreSQL/SQLite support)
- Multilingual NLP: spaCy models for FR, DE, IT, ES, PT
- Language Detection: langdetect, pycld2
- Concurrency: Threading locks for database operations
- Fixed substring keyword matching
- Added sentence length validation
- Added highlight size validation
- Result: 93% reduction in full-page highlights
- Implemented spaCy model caching
- Added advanced text preprocessing
- Introduced confidence scoring system
- Result: 3-5x faster processing + quality metrics
- Multi-column layout handling
- Requirement pattern matching
- Missed sequence fallback
- Result: Complex PDF support + transparency
- Regulatory Compliance: Extract requirements from standards and regulations
- Technical Specifications: Process engineering specification documents
- Contract Analysis: Identify contractual requirements and obligations
- Quality Assurance: Create compliance matrices for verification
- Requirements Engineering: Support requirements management workflows
- PDF Format Requirements: Works best with text-based PDFs (not scanned images). OCR support planned for v2.5+
- Language Support: Currently optimized for English documents. Note: Multilingual support (v3.0) is in active development with language detection and support for French, German, Italian, Spanish, and Portuguese
- Template Dependency: Excel template must contain "MACHINE COMP. MATRIX" sheet name exactly
- Pattern-Enhanced Detection: Uses combination of keywords and NLP patterns for requirement identification
The following features are actively being developed for v3.0:
-
Multilingual Extraction: Language detection and NLP support for French, German, Italian, Spanish, Portuguese
- Modules:
language_detector.py,language_config.py,multilingual_nlp.py,pdf_analyzer_multilingual.py - Status: Core functionality implemented, integration testing in progress
- Modules:
-
Database Backend: SQLAlchemy-based persistence layer with PostgreSQL/SQLite support
- Modules:
database/models.py,database/services/,database/database.py - Features: Project management, document tracking, requirement versioning, processing sessions, change history
- Status: Models complete, services in development
- Modules:
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Testing: Ensure all tests pass before submitting PR
pytest -vThis project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
GPL-3.0: You are free to use, modify, and distribute this software under the terms of the GPL-3.0 license. Any modifications or derivative works must also be released under GPL-3.0.
- spaCy: For the excellent NLP library
- PyMuPDF: For powerful PDF processing capabilities
- PySide6/Qt: For the robust GUI framework
- Issues: GitHub Issues
- Documentation: See CLAUDE.md for developer guide
- Analysis: See NLP_IMPROVEMENT_ANALYSIS.md for technical details
- โ User-adjustable confidence thresholds in GUI
- โ Excel export with confidence score conditional formatting
- โ HTML processing reports with quality metrics
- โ Recent files/projects management
- โ Requirement categorization (9 categories)
- โ Keyword profile management
- โ BASIL SPDX 3.0.1 integration
- โ Thread cleanup fix for multiple sequential runs
- Search/filter functionality in GUI
- Dark mode theme
- Performance optimizations
- ๐ง Multi-language support - Language detection and multilingual NLP (active development)
- ๐ง Database backend - SQLAlchemy-based persistence with PostgreSQL/SQLite support (active development)
- Web-based version with FastAPI/React
- REST API for programmatic access
- Machine learning-based requirement classifier
- OCR support for scanned PDFs
- Requirements traceability across documents
- LLM integration (GPT-4/Claude) for smart extraction
- Collaborative workflows with multi-user support
- Platform integrations (Jira, Azure DevOps, Confluence)
- Cloud processing option
- Advanced analytics and reporting dashboard
- v2.2.0 (Q2 2026 - In Development) - Quality of life improvements, performance optimizations, enhanced testing
- v2.1.1 (2025-11-18) - Bug fix: Thread cleanup for multiple sequential extractions
- v2.1.0 (2025-11-17) - UX enhancements: Recent files/projects, adjustable confidence threshold, BASIL integration
- v2.0.0 (2025-11-15) - Major NLP improvements: accuracy, performance, quality scoring
- v1.2 (Previous) - Base functionality with GUI and Excel generation
- v1.x - Initial releases
Built with โค๏ธ for requirements engineers and compliance professionals
Making requirement extraction accurate, fast, and reliable.