Skip to content

francosax/ReqBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

140 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

ReqBot 2.2.0

Automatic Requirements Extraction Tool for PDF Specifications

Python 3.8+ License: GPL v3 Version Development Status

ReqBot is a powerful desktop application that automatically extracts requirements from PDF specification documents using advanced NLP techniques. It generates compliance matrices in Excel format, BASIL-compatible SPDX exports, and creates annotated PDFs with highlighted requirements.


โœจ What's New

๐Ÿšง Version 2.2.0 (In Development - Q2 2026)

Version 2.2.0 focuses on quality of life improvements and performance:

  • ๐Ÿ” Search & Filter in Results (Planned) - Find and filter extracted requirements in GUI
  • ๐Ÿš€ Parallel PDF Processing (Planned) - 2-3x faster batch operations
  • ๐ŸŽฏ Drag & Drop Support (Planned) - Drag PDFs directly into application
  • ๐Ÿ“บ Real-Time Preview (Planned) - Preview requirements during processing
  • ๐Ÿงช Enhanced Testing (Planned) - 80%+ test coverage, CI/CD pipeline
  • ๐Ÿ“š Comprehensive Documentation (Planned) - Video tutorials, user manual, FAQ

๐Ÿ“ See Full v2.2 Release Notes

Version 2.1.1 (Released)

Critical bug fix for threading and UX enhancements:

  • ๐Ÿ› Fixed Threading Issue - Users can now run multiple sequential extractions without restarting
  • ๐Ÿ“ Recent Files/Projects - Quick access to last 5 used paths via dropdown menus
  • ๐ŸŽš๏ธ Adjustable Confidence Threshold - Interactive slider control (0.0-1.0)

Version 2.1.0 Features:

  • ๐Ÿ“ Recent Files/Projects - Quick access to last 5 used paths via dropdown menus
  • ๐ŸŽš๏ธ Adjustable Confidence Threshold - Interactive slider control (0.0-1.0) with real-time filtering
  • ๐Ÿ“Š Confidence Display in Excel - Color-coded confidence scores (Green/Yellow/Red) with auto-filtering
  • ๐Ÿ”ง Excel Column Corrections - Fixed Priority column positioning after Confidence addition
  • ๐Ÿ”— BASIL SPDX 3.0.1 Integration - Automatic export of requirements to industry-standard SPDX format
  • ๐Ÿ“„ HTML Processing Reports - Automatic generation of detailed processing reports with statistics
  • ๐Ÿท๏ธ Requirement Categorization - Auto-categorize requirements into 9 categories (Functional, Safety, Performance, etc.)
  • ๐ŸŽฏ Keyword Profiles - Predefined and custom keyword sets for different domains (Aerospace, Medical, Automotive, etc.)
  • โœ… Enhanced Testing - 270+ tests organized in tests/ directory (including threading fix verification)

๐Ÿ“ See v2.1 Changes


๐Ÿš€ Key Features

Intelligent Extraction

  • NLP-Powered: Uses spaCy for accurate sentence segmentation and requirement identification
  • Exact Word Matching: Precise keyword matching with regex word boundaries
  • Pattern Recognition: Detects 6 categories of requirement patterns (modal verbs, compliance indicators, etc.)
  • Quality Scoring: Confidence scores (0.0-1.0) for every extracted requirement

Advanced PDF Handling

  • Multi-Column Layout Support: Correctly processes technical specs with multiple columns
  • Smart Text Preprocessing: Handles hyphenated words, Unicode normalization, page number removal
  • Highlight Size Validation: Prevents oversized highlights (max 40% page coverage)
  • Fallback Annotations: Text-only notes when highlighting fails

Excel Compliance Matrix

  • Color-Coded Priorities: Visual priority indicators (High/Medium/Low/Security)
  • Confidence Scoring: Color-coded confidence levels (Green โ‰ฅ0.8, Yellow 0.6-0.8, Red <0.6)
  • Requirement Categorization: Automatic classification into 9 categories (Functional, Safety, Performance, Security, Interface, Data, Compliance, Documentation, Testing)
  • Data Validations: Dropdown lists for standardized input
  • Formula Integration: Automated compliance score calculations
  • Template-Based: Uses customizable Excel templates

User Experience

  • Modern GUI: PySide6-based desktop application
  • Background Processing: Non-blocking UI with progress tracking
  • Comprehensive Logging: Detailed logs for debugging and quality assurance
  • Recent Projects: Quick access to last 5 used paths via dropdown menus
  • Keyword Profiles: Predefined profiles for different domains (Aerospace, Medical, Automotive, Software, Safety, Generic)
  • Processing Reports: Automatic HTML reports with statistics, quality metrics, and warnings

BASIL SPDX 3.0.1 Integration

  • Industry Standard Export: Automatic export to SPDX 3.0.1 JSON-LD format (BASIL-compatible)
  • Requirement Interchange: Share requirements with other tools using standardized SPDX format
  • Import Support: Import existing BASIL requirements with merge strategies (append, update, replace)
  • Validation: Built-in SPDX format validation to ensure compliance
  • Traceability: Maintains requirement IDs and metadata for traceability

๐Ÿ“Š Performance Metrics

Metric Before v2.0 After v2.0 Improvement
Full-Page Highlight Errors ~15% <1% 93% reduction
False Positive Extractions ~30% ~5-8% 73-83% reduction
Processing Speed Baseline 3-5x 300-400% faster
PDF Parsing Quality Fair Excellent 40-50% better

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.8+ (Required for PySide6 compatibility)
  • Microsoft Build Tools 2022 (Windows only, for certain dependencies)

Step 1: Clone the Repository

git clone https://github.com/francosax/ReqBot.git
cd ReqBot

Step 2: Create Virtual Environment (Recommended)

python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/Mac
source .venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Download spaCy Language Model

# Install specific version compatible with Python 3.8
pip install spacy==3.4.4

# Download the English language model
python -m spacy download en_core_web_sm

Note: en_core_web_sm is not a standard pip package. It's a pre-trained spaCy model installed via the spacy download command.


๐ŸŽฎ Usage

Option 1: Launch via Menu (Recommended)

python run_app.py

This presents an interactive menu:

  1. Run GUI Application
  2. Run Tests
  3. Exit

Option 2: Direct GUI Launch

python main_app.py

GUI Workflow

  1. Select Input Folder: Choose folder containing PDF specification files (or use Recent dropdown)
  2. Select Output Folder: Choose where to save results (or use Recent dropdown)
  3. Load Compliance Matrix Template: Select Excel template file (or use Recent dropdown)
  4. Choose Keyword Profile (Optional): Select from Generic, Aerospace, Medical, Automotive, Software, or Safety profiles
  5. Adjust Confidence Threshold (Optional): Use slider to set minimum confidence (0.0-1.0, default 0.5)
  6. Start Processing: Click "Start" to begin extraction
  7. Review Results:
    • Excel compliance matrix with requirements, confidence scores, and categories
    • Annotated PDF with highlighted requirements
    • BASIL SPDX 3.0.1 JSON-LD export
    • HTML processing report with statistics and quality metrics

๐Ÿ“ Project Structure

ReqBot/
โ”œโ”€โ”€ main_app.py                      # Main GUI application (v2.1.1 threading fix)
โ”œโ”€โ”€ run_app.py                       # Interactive launcher menu
โ”œโ”€โ”€ version.py                       # Version information (single source of truth)
โ”œโ”€โ”€ RB_coordinator.py                # Processing pipeline orchestrator
โ”œโ”€โ”€ pdf_analyzer.py                  # NLP extraction engine (v2.0 enhanced)
โ”œโ”€โ”€ pdf_analyzer_multilingual.py     # Multilingual NLP extraction (v3.0 in progress)
โ”œโ”€โ”€ highlight_requirements.py        # PDF annotation module (v2.0 enhanced)
โ”œโ”€โ”€ excel_writer.py                  # Excel matrix generator
โ”œโ”€โ”€ basil_integration.py             # BASIL SPDX 3.0.1 export/import
โ”œโ”€โ”€ report_generator.py              # HTML processing report generator
โ”œโ”€โ”€ requirement_categorizer.py       # Automatic requirement categorization
โ”œโ”€โ”€ keyword_profiles.py              # Keyword profile management
โ”œโ”€โ”€ recent_projects.py               # Recent files/folders manager
โ”œโ”€โ”€ language_detector.py             # Language detection (v3.0 in progress)
โ”œโ”€โ”€ language_config.py               # Language configuration (v3.0 in progress)
โ”œโ”€โ”€ multilingual_nlp.py              # Multilingual NLP support (v3.0 in progress)
โ”œโ”€โ”€ config_RB.py                     # Configuration manager
โ”œโ”€โ”€ processing_worker.py             # Background thread worker
โ”œโ”€โ”€ get_all_files.py                 # File utilities
โ”œโ”€โ”€ RBconfig.ini                     # Keyword configuration
โ”œโ”€โ”€ keyword_profiles.json            # Keyword profiles storage
โ”œโ”€โ”€ recents_config.json              # Recent paths storage
โ”œโ”€โ”€ template_compliance_matrix.xlsx  # Excel template
โ”œโ”€โ”€ database/                        # Database backend (v3.0 in progress)
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ database.py                  # Database initialization
โ”‚   โ”œโ”€โ”€ models.py                    # SQLAlchemy models
โ”‚   โ””โ”€โ”€ services/                    # Database services
โ”œโ”€โ”€ tests/                           # Test suite (270+ tests)
โ”‚   โ”œโ”€โ”€ conftest.py                  # Pytest configuration
โ”‚   โ”œโ”€โ”€ test_gui.py                  # GUI tests (8 tests)
โ”‚   โ”œโ”€โ”€ test_excel_writer.py         # Excel tests (3 tests)
โ”‚   โ”œโ”€โ”€ test_highlight_requirements.py # PDF highlighting tests (2 tests)
โ”‚   โ”œโ”€โ”€ test_basil_integration.py    # BASIL tests (25 tests)
โ”‚   โ”œโ”€โ”€ test_report_generator.py     # Report generator tests
โ”‚   โ”œโ”€โ”€ test_database_*.py           # Database tests (v3.0)
โ”‚   โ”œโ”€โ”€ test_language_*.py           # Language detection tests (v3.0)
โ”‚   โ”œโ”€โ”€ test_multilingual_nlp.py     # Multilingual NLP tests (v3.0)
โ”‚   โ””โ”€โ”€ test_integration*.py         # Integration tests
โ”œโ”€โ”€ CLAUDE.md                        # Developer documentation
โ”œโ”€โ”€ README.md                        # This file
โ”œโ”€โ”€ TODO.md                          # Project roadmap
โ””โ”€โ”€ requirements.txt                 # Python dependencies

โš™๏ธ Configuration

Customize Requirement Keywords

Edit RBconfig.ini to customize requirement keywords:

[RequirementKeywords]
keywords = shall, must, should, will, has to, require, required

Adjust Extraction Parameters

Edit constants in pdf_analyzer.py:

MIN_REQUIREMENT_LENGTH_WORDS = 5    # Minimum sentence length
MAX_REQUIREMENT_LENGTH_WORDS = 100  # Maximum sentence length

Edit constants in highlight_requirements.py:

MAX_HIGHLIGHT_COVERAGE_PERCENT = 40  # Maximum page coverage

๐Ÿ“Š Output Format

Excel Compliance Matrix

Sheet Name: MACHINE COMP. MATRIX

Columns Include:

  • Label Number
  • Description (requirement text)
  • Page number
  • Keyword that triggered extraction
  • Confidence score with color coding (NEW in v2.0)
  • Priority (High/Medium/Low/Security)
  • Category (Functional, Safety, Performance, etc.) (NEW in v2.1)
  • Compliance status fields
  • Verification fields
  • Notes and metadata

Starting Row: Data written from row 5 (rows 1-4 reserved for headers)

Annotated PDF

  • Yellow highlights on requirement text
  • Text annotations with label and description
  • Fallback annotations for oversized or unfound highlights

Processing Report (HTML)

Filename: YYYY.MM.DD_HHMMSS_Processing_Report.html

Contents:

  • Processing summary with total requirements extracted
  • Average confidence score across all requirements
  • File-by-file breakdown with statistics
  • Warnings and errors encountered
  • Quality metrics and visualizations
  • Color-coded confidence indicators

๐Ÿงช Testing

Run the comprehensive test suite:

# All tests (from project root)
pytest -v

# Specific test file
pytest tests/test_gui.py -v

# Specific test
pytest tests/test_gui.py::test_threading_fix_prevents_double_start -v

# With coverage
pytest --cov=. --cov-report=html

Test Coverage (270+ tests organized in tests/ directory):

  • โœ… GUI components (8 tests) - Threading, UI elements, user interactions
  • โœ… Excel writer functionality (3 tests) - Matrix generation, formatting
  • โœ… PDF highlighting (2 tests) - Annotation, highlight validation
  • โœ… BASIL integration (25 tests) - Export, import, validation, merge strategies
  • โœ… Report generator - HTML report generation with statistics
  • โœ… Database models and services (v3.0 in progress) - SQLAlchemy ORM tests
  • โœ… Language detection (v3.0 in progress) - Multilingual support tests
  • โœ… Multilingual NLP (v3.0 in progress) - Multi-language extraction tests
  • โœ… Integration tests - End-to-end workflow validation
  • โœ… Thread safety tests - Concurrent operations validation

Current Status: 263 passing, 4 pre-existing failures (unrelated to core functionality), 3 pre-existing errors


๐Ÿ“š Documentation


๐Ÿ”ง Tech Stack

Core Dependencies (v2.1.1)

  • UI Framework: PySide6 (Qt for Python)
  • PDF Processing: PyMuPDF (fitz)
  • NLP Engine: spaCy 3.4.4 with en_core_web_sm model
  • Data Handling: Pandas, openpyxl
  • Export Formats: JSON-LD (SPDX 3.0.1), HTML reports
  • Testing: pytest, pytest-qt (270+ tests)
  • Threading: QThread for non-blocking operations
  • Configuration: ConfigParser, JSON

v3.0 Additional Dependencies (In Development)

  • Database: SQLAlchemy (PostgreSQL/SQLite support)
  • Multilingual NLP: spaCy models for FR, DE, IT, ES, PT
  • Language Detection: langdetect, pycld2
  • Concurrency: Threading locks for database operations

๐Ÿ“ˆ NLP Improvements Timeline

Phase 1: Critical Fixes (2025-11)

  • Fixed substring keyword matching
  • Added sentence length validation
  • Added highlight size validation
  • Result: 93% reduction in full-page highlights

Phase 2: Performance & Quality (2025-11)

  • Implemented spaCy model caching
  • Added advanced text preprocessing
  • Introduced confidence scoring system
  • Result: 3-5x faster processing + quality metrics

Phase 3: Advanced Features (2025-11)

  • Multi-column layout handling
  • Requirement pattern matching
  • Missed sequence fallback
  • Result: Complex PDF support + transparency

๐ŸŽฏ Use Cases

  • Regulatory Compliance: Extract requirements from standards and regulations
  • Technical Specifications: Process engineering specification documents
  • Contract Analysis: Identify contractual requirements and obligations
  • Quality Assurance: Create compliance matrices for verification
  • Requirements Engineering: Support requirements management workflows

๐Ÿ› Known Limitations

  • PDF Format Requirements: Works best with text-based PDFs (not scanned images). OCR support planned for v2.5+
  • Language Support: Currently optimized for English documents. Note: Multilingual support (v3.0) is in active development with language detection and support for French, German, Italian, Spanish, and Portuguese
  • Template Dependency: Excel template must contain "MACHINE COMP. MATRIX" sheet name exactly
  • Pattern-Enhanced Detection: Uses combination of keywords and NLP patterns for requirement identification

๐Ÿšง Features in Development

The following features are actively being developed for v3.0:

  • Multilingual Extraction: Language detection and NLP support for French, German, Italian, Spanish, Portuguese

    • Modules: language_detector.py, language_config.py, multilingual_nlp.py, pdf_analyzer_multilingual.py
    • Status: Core functionality implemented, integration testing in progress
  • Database Backend: SQLAlchemy-based persistence layer with PostgreSQL/SQLite support

    • Modules: database/models.py, database/services/, database/database.py
    • Features: Project management, document tracking, requirement versioning, processing sessions, change history
    • Status: Models complete, services in development

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Testing: Ensure all tests pass before submitting PR

pytest -v

๐Ÿ“ License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

GPL-3.0: You are free to use, modify, and distribute this software under the terms of the GPL-3.0 license. Any modifications or derivative works must also be released under GPL-3.0.


๐Ÿ™ Acknowledgments

  • spaCy: For the excellent NLP library
  • PyMuPDF: For powerful PDF processing capabilities
  • PySide6/Qt: For the robust GUI framework

๐Ÿ“ž Support


๐Ÿ”ฎ Roadmap

โœ… Completed in v2.1:

  • โœ… User-adjustable confidence thresholds in GUI
  • โœ… Excel export with confidence score conditional formatting
  • โœ… HTML processing reports with quality metrics
  • โœ… Recent files/projects management
  • โœ… Requirement categorization (9 categories)
  • โœ… Keyword profile management
  • โœ… BASIL SPDX 3.0.1 integration
  • โœ… Thread cleanup fix for multiple sequential runs

Planned for v2.2 (Q2 2026):

  • Search/filter functionality in GUI
  • Dark mode theme
  • Performance optimizations

In Development (v3.0 - 2027):

  • ๐Ÿšง Multi-language support - Language detection and multilingual NLP (active development)
  • ๐Ÿšง Database backend - SQLAlchemy-based persistence with PostgreSQL/SQLite support (active development)
  • Web-based version with FastAPI/React
  • REST API for programmatic access
  • Machine learning-based requirement classifier
  • OCR support for scanned PDFs
  • Requirements traceability across documents

Long-Term Vision:

  • LLM integration (GPT-4/Claude) for smart extraction
  • Collaborative workflows with multi-user support
  • Platform integrations (Jira, Azure DevOps, Confluence)
  • Cloud processing option
  • Advanced analytics and reporting dashboard

๐Ÿ“Š Version History

  • v2.2.0 (Q2 2026 - In Development) - Quality of life improvements, performance optimizations, enhanced testing
  • v2.1.1 (2025-11-18) - Bug fix: Thread cleanup for multiple sequential extractions
  • v2.1.0 (2025-11-17) - UX enhancements: Recent files/projects, adjustable confidence threshold, BASIL integration
  • v2.0.0 (2025-11-15) - Major NLP improvements: accuracy, performance, quality scoring
  • v1.2 (Previous) - Base functionality with GUI and Excel generation
  • v1.x - Initial releases

Built with โค๏ธ for requirements engineers and compliance professionals

Making requirement extraction accurate, fast, and reliable.

About

Requirements gathering tool

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors