Skip to content

Johannes613/arabic-dialect-sentiment

Repository files navigation

Arabic Dialect Sentiment Analysis: Domain-Adapted Transformer for Gulf Arabic

Python Badge PyTorch Badge Transformers Badge FastAPI Badge React Badge Docker Badge Scikit-learn Badge

Arabic Dialect Sentiment Analysis is a comprehensive AI research project focused on developing advanced sentiment analysis capabilities for Gulf Arabic dialects. This project implements a Domain-Adapted Transformer (MARBERT) that achieves outstanding performance on Arabic sentiment classification, specifically addressing the challenges of low-resource language variants.

Live Demo: https://arabic-dialect-sentiment.vercel.app

Image

Project Overview

This project addresses the critical challenge of sentiment analysis in Arabic dialects, particularly Gulf Arabic, which has been historically underrepresented in NLP research. The solution combines advanced transformer architecture with domain-adaptive pretraining and sophisticated data augmentation techniques.

Key Achievements

  • 88% Overall Accuracy on the ASTD dataset
  • 86% Macro F1 Score - exceptional balanced performance
  • All classes above 83% F1 - remarkable consistency
  • 29% improvement in Macro F1 from baseline models

The Problem

  • Low-Resource Language: Arabic dialects lack sufficient labeled data for sentiment analysis
  • Dialectal Variation: Gulf Arabic differs significantly from Modern Standard Arabic
  • Class Imbalance: Traditional models perform poorly on minority sentiment classes
  • Limited Research: Few specialized models for Arabic dialect sentiment analysis
  • Performance Gap: Existing models show bias toward majority classes

Our Enhanced MARBERT Model addresses these challenges through innovative class balancing, data augmentation, and domain-adaptive training techniques.


Core Features

1. Enhanced MARBERT Architecture

  • Base Model: UBC-NLP/MARBERT pre-trained on Arabic text
  • Classification Head: 4-class sentiment classification (NEG, POS, NEUTRAL, OBJ)
  • Optimized Training: Custom hyperparameters for Arabic dialect processing
  • Device Optimization: CUDA support with mixed precision training

2. Advanced Data Processing

  • Class Balancing: Undersampling majority classes, oversampling minority classes
  • Arabic Text Augmentation: Synonym replacement, diacritic variations, character-level augmentation
  • Text Preprocessing: URL removal, mention/hashtag cleaning, Arabic character preservation
  • Smart Filtering: Length-based filtering and quality assessment

3. Performance Optimization

  • Weighted Loss Functions: Custom loss computation for class imbalance
  • Hyperparameter Tuning: Optimized learning rates, batch sizes, and training epochs
  • Gradient Accumulation: Effective batch size optimization
  • Early Stopping: Prevents overfitting with validation monitoring

Model Performance & Results

Performance Metrics

MetricOriginalEnhancedImprovement
Overall Accuracy73%88%+15%
Macro F157%86%+29%
NEG F157%87%+30%
POS F152%85%+33%
NEUTRAL F135%83%+48%
OBJ F183%90%+7%

Class Distribution Analysis

  • Original Dataset: Highly imbalanced (OBJ: 982, NEG: 251, POS: 117, NEUTRAL: 123)
  • Balanced Dataset: 500 samples per class for fair training
  • Augmentation Strategy: 3x augmentation for minority classes
  • Validation Split: 80/20 train-validation split with stratification

Tech Stack

ComponentTechnologies Used
Core ML FrameworkPyTorch, Transformers, Scikit-learn
Language ModelUBC-NLP/MARBERT, AutoModelForSequenceClassification
Data ProcessingPandas, NumPy, Arabic text preprocessing
Training & EvaluationHugging Face Trainer, Custom metrics
Web ApplicationFastAPI (Backend), React (Frontend)
Docker, Docker Compose
DevelopmentGoogle Colab, Jupyter Notebooks

Project Structure

Repository Organization

  • src/ - Core Python modules for data processing, model training, and evaluation
  • models/ - Trained model files and configurations
  • data/ - Dataset files and preprocessing scripts
  • webapp/ - FastAPI backend and React frontend
  • notebooks/ - Jupyter notebooks for training and experimentation
  • scripts/ - Utility scripts for setup and testing
  • configs/ - YAML configuration files for different components

Key Components

  • ASTD Data Loader: Specialized loader for Arabic Sentiment Tweets Dataset
  • Enhanced Preprocessor: Arabic-specific text cleaning and augmentation
  • MARBERT Fine-tuner: Custom training pipeline with weighted loss
  • Web Interface: User-friendly sentiment analysis application

Getting Started

Prerequisites

  • Python 3.8+
  • PyTorch 1.9+
  • Transformers 4.20+
  • CUDA-compatible GPU (recommended)
  • Google Colab account (for training)

Quick Start

  1. Clone the repository:
    git clone https://github.com/Johannes613/arabic-dialect-sentiment.git
    cd arabic-dialect-sentiment
  2. Install dependencies:
    pip install -r requirements.txt
  3. Download the trained model:
    # Model files are in models/ directory
    # Load with: AutoModelForSequenceClassification.from_pretrained("./models")

Training Your Own Model

  1. Upload the training notebook to Google Colab:
    # Use notebooks/marbert_finetuning_enhanced.ipynb
    # Follow the step-by-step training process
  2. Prepare your dataset:
    # Place ASTD dataset in data/raw/
    # Run preprocessing scripts for class balancing
  3. Start training:
    # Execute training cells in Colab
    # Monitor performance metrics

API Endpoints

Sentiment Analysis

  • POST /analyze - Single text sentiment analysis
  • POST /analyze/batch - Batch sentiment analysis
  • POST /preprocess - Arabic text preprocessing
  • GET /health - API health check

Model Management

  • GET /model/info - Model performance metrics
  • POST /model/retrain - Trigger model retraining

Advanced Features

Data Augmentation Techniques

  • Synonym Replacement: Arabic word synonyms for minority classes
  • Diacritic Variations: Character-level augmentation (ا → أ, إ, آ)
  • Smart Augmentation: Class-specific augmentation strategies
  • Quality Control: Augmentation validation and filtering

Training Optimizations

  • Learning Rate Scheduling: Cosine annealing with warmup
  • Gradient Accumulation: Effective batch size optimization
  • Mixed Precision: FP16 training for faster convergence
  • Early Stopping: Validation-based training termination

Future Enhancements

  • Multi-Dialect Support: Extend to other Arabic dialects (Egyptian, Levantine)
  • Domain Adaptation: Specialized models for social media, news, reviews
  • Real-time Processing: Streaming sentiment analysis capabilities
  • Ensemble Methods: Combine multiple model architectures
  • Active Learning: Interactive model improvement with user feedback
  • Mobile Deployment: Optimized models for mobile applications

Research Contributions

This project contributes to the field of Arabic NLP by:

  • Addressing Class Imbalance: Novel approaches for Arabic sentiment analysis
  • Dialectal Adaptation: Specialized models for Gulf Arabic
  • Performance Benchmarking: New baseline for Arabic sentiment analysis
  • Open Source Release: Complete training pipeline and models

Contributing

We welcome contributions to improve Arabic dialect sentiment analysis! Areas for contribution include:

  • Additional Arabic dialect support
  • Enhanced data augmentation techniques
  • Model architecture improvements
  • Performance optimization
  • Documentation and tutorials

Citation

If you use this work in your research, please cite:

@misc{arabic_dialect_sentiment_2024,
  title={Enhanced MARBERT for Arabic Dialect Sentiment Analysis},
  author={Yohannis Adamu},
  year={2024},
  url={https://github.com/Johannes613/arabic-dialect-sentiment}
}
  

Arabic Dialect Sentiment Analysis - Advancing Arabic NLP Through Domain-Adapted Transformers

About

Arabic sentiment analysis model that classifies text into different categories using MARBERT(Transformers)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors