Arabic Dialect Sentiment Analysis is a comprehensive AI research project focused on developing advanced sentiment analysis capabilities for Gulf Arabic dialects. This project implements a Domain-Adapted Transformer (MARBERT) that achieves outstanding performance on Arabic sentiment classification, specifically addressing the challenges of low-resource language variants.
Live Demo: https://arabic-dialect-sentiment.vercel.app
This project addresses the critical challenge of sentiment analysis in Arabic dialects, particularly Gulf Arabic, which has been historically underrepresented in NLP research. The solution combines advanced transformer architecture with domain-adaptive pretraining and sophisticated data augmentation techniques.
- 88% Overall Accuracy on the ASTD dataset
- 86% Macro F1 Score - exceptional balanced performance
- All classes above 83% F1 - remarkable consistency
- 29% improvement in Macro F1 from baseline models
- Low-Resource Language: Arabic dialects lack sufficient labeled data for sentiment analysis
- Dialectal Variation: Gulf Arabic differs significantly from Modern Standard Arabic
- Class Imbalance: Traditional models perform poorly on minority sentiment classes
- Limited Research: Few specialized models for Arabic dialect sentiment analysis
- Performance Gap: Existing models show bias toward majority classes
Our Enhanced MARBERT Model addresses these challenges through innovative class balancing, data augmentation, and domain-adaptive training techniques.
- Base Model: UBC-NLP/MARBERT pre-trained on Arabic text
- Classification Head: 4-class sentiment classification (NEG, POS, NEUTRAL, OBJ)
- Optimized Training: Custom hyperparameters for Arabic dialect processing
- Device Optimization: CUDA support with mixed precision training
- Class Balancing: Undersampling majority classes, oversampling minority classes
- Arabic Text Augmentation: Synonym replacement, diacritic variations, character-level augmentation
- Text Preprocessing: URL removal, mention/hashtag cleaning, Arabic character preservation
- Smart Filtering: Length-based filtering and quality assessment
- Weighted Loss Functions: Custom loss computation for class imbalance
- Hyperparameter Tuning: Optimized learning rates, batch sizes, and training epochs
- Gradient Accumulation: Effective batch size optimization
- Early Stopping: Prevents overfitting with validation monitoring
| Metric | Original | Enhanced | Improvement |
|---|---|---|---|
| Overall Accuracy | 73% | 88% | +15% |
| Macro F1 | 57% | 86% | +29% |
| NEG F1 | 57% | 87% | +30% |
| POS F1 | 52% | 85% | +33% |
| NEUTRAL F1 | 35% | 83% | +48% |
| OBJ F1 | 83% | 90% | +7% |
- Original Dataset: Highly imbalanced (OBJ: 982, NEG: 251, POS: 117, NEUTRAL: 123)
- Balanced Dataset: 500 samples per class for fair training
- Augmentation Strategy: 3x augmentation for minority classes
- Validation Split: 80/20 train-validation split with stratification
| Component | Technologies Used |
|---|---|
| Core ML Framework | PyTorch, Transformers, Scikit-learn |
| Language Model | UBC-NLP/MARBERT, AutoModelForSequenceClassification |
| Data Processing | Pandas, NumPy, Arabic text preprocessing |
| Training & Evaluation | Hugging Face Trainer, Custom metrics |
| Web Application | FastAPI (Backend), React (Frontend) |
| Docker, Docker Compose | |
| Development | Google Colab, Jupyter Notebooks |
- src/ - Core Python modules for data processing, model training, and evaluation
- models/ - Trained model files and configurations
- data/ - Dataset files and preprocessing scripts
- webapp/ - FastAPI backend and React frontend
- notebooks/ - Jupyter notebooks for training and experimentation
- scripts/ - Utility scripts for setup and testing
- configs/ - YAML configuration files for different components
- ASTD Data Loader: Specialized loader for Arabic Sentiment Tweets Dataset
- Enhanced Preprocessor: Arabic-specific text cleaning and augmentation
- MARBERT Fine-tuner: Custom training pipeline with weighted loss
- Web Interface: User-friendly sentiment analysis application
- Python 3.8+
- PyTorch 1.9+
- Transformers 4.20+
- CUDA-compatible GPU (recommended)
- Google Colab account (for training)
-
Clone the repository:
git clone https://github.com/Johannes613/arabic-dialect-sentiment.git cd arabic-dialect-sentiment -
Install dependencies:
pip install -r requirements.txt -
Download the trained model:
# Model files are in models/ directory # Load with: AutoModelForSequenceClassification.from_pretrained("./models")
-
Upload the training notebook to Google Colab:
# Use notebooks/marbert_finetuning_enhanced.ipynb # Follow the step-by-step training process -
Prepare your dataset:
# Place ASTD dataset in data/raw/ # Run preprocessing scripts for class balancing -
Start training:
# Execute training cells in Colab # Monitor performance metrics
- POST /analyze - Single text sentiment analysis
- POST /analyze/batch - Batch sentiment analysis
- POST /preprocess - Arabic text preprocessing
- GET /health - API health check
- GET /model/info - Model performance metrics
- POST /model/retrain - Trigger model retraining
- Synonym Replacement: Arabic word synonyms for minority classes
- Diacritic Variations: Character-level augmentation (ا → أ, إ, آ)
- Smart Augmentation: Class-specific augmentation strategies
- Quality Control: Augmentation validation and filtering
- Learning Rate Scheduling: Cosine annealing with warmup
- Gradient Accumulation: Effective batch size optimization
- Mixed Precision: FP16 training for faster convergence
- Early Stopping: Validation-based training termination
- Multi-Dialect Support: Extend to other Arabic dialects (Egyptian, Levantine)
- Domain Adaptation: Specialized models for social media, news, reviews
- Real-time Processing: Streaming sentiment analysis capabilities
- Ensemble Methods: Combine multiple model architectures
- Active Learning: Interactive model improvement with user feedback
- Mobile Deployment: Optimized models for mobile applications
This project contributes to the field of Arabic NLP by:
- Addressing Class Imbalance: Novel approaches for Arabic sentiment analysis
- Dialectal Adaptation: Specialized models for Gulf Arabic
- Performance Benchmarking: New baseline for Arabic sentiment analysis
- Open Source Release: Complete training pipeline and models
We welcome contributions to improve Arabic dialect sentiment analysis! Areas for contribution include:
- Additional Arabic dialect support
- Enhanced data augmentation techniques
- Model architecture improvements
- Performance optimization
- Documentation and tutorials
If you use this work in your research, please cite:
@misc{arabic_dialect_sentiment_2024,
title={Enhanced MARBERT for Arabic Dialect Sentiment Analysis},
author={Yohannis Adamu},
year={2024},
url={https://github.com/Johannes613/arabic-dialect-sentiment}
}
Arabic Dialect Sentiment Analysis - Advancing Arabic NLP Through Domain-Adapted Transformers