Sophisticated toolkit for evaluating Qwen model variants and LoRA fine-tuned models.
# Install dependencies
pip3 install -r requirements-eval.txt
# List test suites
python3 qwen-eval-v2.py --list-suites
# Run evaluation
python3 qwen-eval-v2.py --models qwen2.5:8b qwen-8b-dialog-v1 --verbose- QWEN-EVAL-V2-README.md - Comprehensive guide (v2 framework)
- MIGRATION-V1-TO-V2.md - Migration guide from v1
- QWEN-EVAL-README.md - v1 documentation
evaluation/
├── README.md # This file
├── qwen-eval-v2.py # Main CLI (v2 - recommended)
├── qwen-eval.py # Legacy CLI (v1 - simple)
├── qwen_eval/ # Core package
│ ├── config.py # Configuration & logging
│ ├── core.py # Evaluation engine
│ ├── test_suites.py # Built-in test suites
│ ├── metrics.py # Metric registry (15+ metrics)
│ └── reporters.py # Output formatters
├── eval-config-example.yaml # Example configuration
├── requirements-eval.txt # Python dependencies
├── QWEN-EVAL-V2-README.md # Full documentation
└── MIGRATION-V1-TO-V2.md # Upgrade guide
- Parallel execution - Run evaluations concurrently
- Result caching - Avoid redundant inference
- YAML configs - Reproducible evaluations
- 15+ metrics - Pedagogical quality, dialogue, complexity
- 4 test suites - Pedagogical, dialogue, baseline, stress
- 3 reporters - JSON, Markdown, side-by-side comparison
- Extensible - Add custom metrics and test suites
python3 qwen-eval-v2.py --models base-model fine-tuned-v1python3 qwen-eval-v2.py --config eval-config-example.yamlpython3 qwen-eval-v2.py --workers 8 --cache --verbose- Main project: ../README.md
- Fine-tuning workflows: ../fine-tuning/
- Training logs: ../DEVLOG.md