A comprehensive toolkit for Hebrew coreference resolution that integrates multiple components into a unified pipeline. This system provides mention detection, web-based annotation, neural coreference models, and LLM evaluation capabilities.
- Mention Detection: Advanced NP chunking with multiple parser backends (Stanza, Trankit, Gold)
- Web Annotation: Interactive web interface for coreference and NP-relation annotation
- Neural Models: State-of-the-art neural coreference resolution (LingMess-Coref, WL-Coref)
- LLM Evaluation: Comprehensive evaluation of Large Language Models for coreference tasks
- β Multi-Parser Support: Stanza, Trankit, and Gold standard parsing
- β Interactive Annotation: Web-based interface for manual annotation
- β Neural Training: End-to-end neural model training and evaluation
- β LLM Integration: Zero-shot and few-shot evaluation of LLMs
- β Comprehensive Evaluation: MUC, BΒ³, CEAF metrics
- β Production Ready: Clean CLI interface and modular architecture
- β Extensible: Easy to add new models and components
- Python 3.8+
- Git
# Clone the repository
git clone https://github.com/shaked571/hebrew_coreference.git
cd hebrew_coreference
# Clone the data submodule
git submodule update --init --recursive
# Install dependencies
pip install -r requirements.txt# Run NP chunking with Stanza parser
python src/mention_detection/stanza_parser/stanza_chunker.py
# Run NP chunking with Trankit parser
python src/mention_detection/trankit_parser/trankit2spacy.py# Train LingMess-Coref model
cd src/neural_models/neural_coref/src/lingmess-coref
python training.py
# Evaluate model
python eval.py# Compare LLM outputs
cd error_analysis/llm_comparison
python compare_multi_system_outputs.py --approaches raw gold_tokenized sota_tokenized# Start annotation server
cd src/annotation/tne_ui
python annotationServer.pyhebrew_coreference/
βββ src/
β βββ mention_detection/ # NP chunking and parsing
β βββ neural_models/ # Neural coreference models
β βββ llm_evaluation/ # LLM evaluation framework
β βββ annotation/ # Web-based annotation tools
βββ error_analysis/ # Error analysis and comparison tools
βββ statistics/ # Statistical analysis tools
βββ tests/ # Test suite
βββ data/ # Data submodule (separate repo)
βββ scripts/ # Utility scripts
The system supports multiple parsing backends for robust mention detection:
- Stanza: Stanford's NLP toolkit with Hebrew support
- Trankit: Unified multilingual NLP toolkit
- Gold: Manual annotation for high-quality reference
Two state-of-the-art neural architectures:
- LingMess-Coref: End-to-end neural coreference resolution
- WL-Coref: Span-based coreference with BERT encoding
Comprehensive evaluation framework for Large Language Models:
- Raw Tokenization: Direct LLM output evaluation
- Gold Tokenization: Aligned with reference tokenization
- SOTA Tokenization: State-of-the-art tokenization alignment
Interactive annotation interface for:
- Coreference annotation
- NP-relation annotation
- Mention boundary correction
- Quality control and validation
export PYTHONPATH="${PYTHONPATH}:$(pwd)/src"
export CUDA_VISIBLE_DEVICES=0 # For GPU trainingModels can be configured through YAML files in their respective directories:
src/neural_models/neural_coref/src/lingmess-coref/config.yamlsrc/neural_models/neural_coref/src/wl-coref/config.toml
from src.mention_detection.np_chunker.chunker import NPChunker
chunker = NPChunker()
chunks = chunker.chunk_text("ΧΧΧ§Χ‘Χ ΧΧ’ΧΧ¨ΧΧͺ Χ’Χ Χ©ΧΧΧͺ Χ’Χ¦Χ")
print(chunks)from src.neural_models.neural_coref.src.lingmess_coref.metrics import CorefEvaluator
evaluator = CorefEvaluator()
scores = evaluator.evaluate(predictions, gold)
print(f"MUC: {scores['muc']}, BΒ³: {scores['b3']}, CEAF: {scores['ceaf']}")from error_analysis.llm_comparison.compare_multi_system_outputs import MultiSystemComparisonRunner
runner = MultiSystemComparisonRunner()
results = runner.run_comprehensive_analysis()Run the comprehensive test suite:
# Run all tests
python -m pytest tests/
# Run specific test categories
python -m pytest tests/test_mention_detection.py
python -m pytest tests/test_neural_models.py
python -m pytest tests/test_llm_evaluation.pyThe system includes comprehensive error analysis tools:
- Cluster-level Analysis: Detailed breakdown of correct/missed/extra clusters
- Multi-system Comparison: Compare different approaches side-by-side
- Visualization: Generate charts and graphs for analysis
- Statistical Testing: Significance testing for performance differences
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hebrew NLP community for language resources
- Stanford NLP Group for Stanza
- Microsoft Research for Trankit
- Contributors to LingMess-Coref and WL-Coref
For questions and support:
- Open an issue on GitHub
- Check the documentation in each component directory
- Review the error analysis outputs for troubleshooting
Note: This system is designed specifically for Hebrew text and includes Hebrew-specific optimizations and linguistic features.