Authors: Sungwook Yoon
Affiliation: Gyeongbuk Development Institute
Email: [email protected], [email protected]
This repository provides a complete implementation of a transformer-based multimodal emotion recognition framework that integrates uncertainty quantification and explainable AI capabilities. The system processes textual, auditory, and visual modalities through specialized encoders and fuses them via sophisticated cross-modal attention mechanisms.
- Multimodal Architecture: Hierarchical transformer design with specialized encoders for text, audio, and video
- Uncertainty Quantification: Monte Carlo dropout and deep ensemble techniques for calibrated confidence estimates
- Explainable AI: Attention visualizations, feature importance scores, and counterfactual explanations
- State-of-the-art Performance: 93.7% accuracy on MELD dataset with excellent calibration (ECE=0.031)
pip install -r requirements.txt- Setup Data:
python scripts/download_datasets.py --use_dummy- Reproduction Training:
python train_reproduction.py --epochs 10- Evaluation:
python evaluate.py --config configs/reproduction_config.yaml- Inference:
python inference.py --model_path logs/reproduction/best_model.pth- MELD: Multimodal EmotionLines Dataset (Primary)
- IEMOCAP: Interactive Emotional Dyadic Motion Capture
- CMU-MOSEI: CMU Multimodal Opinion Sentiment and Emotion Intensity
- CMU-MOSI: CMU Multimodal Opinion Sentiment and Intensity
- AVEC2019: Audio/Visual Emotion Challenge 2019
- Text: BERT-base (768 dim, 12 layers, 12 heads)
- Audio: CNN + Transformer (512 dim, 6 layers, 8 heads)
- Video: ResNet50 + Transformer (512 dim, 4 layers, 8 heads)
- Cross-Modal Attention: 8-head attention, 4 layers
- Fusion Dimension: 768
- Method: Hierarchical fusion
- Method: Ensemble + Monte Carlo Dropout
- MC Samples: 100
- Ensemble Size: 5
- Calibration: Temperature scaling
Key hyperparameters are defined in configs/reproduction_config.yaml:
training:
learning_rate: 2e-5
batch_size: 16
epochs: 50
model:
fusion:
fusion_dim: 768
num_heads: 8
num_layers: 4| Dataset | Accuracy | F1-Score | ECE |
|---|---|---|---|
| MELD | 93.7% | 92.8% | 0.031 |
| IEMOCAP | 91.8% | 90.5% | 0.045 |
| MOSEI | 89.2% | 88.1% | 0.052 |
@article{yoon2024transformer,
title={Transformer-Based Multimodal Emotion Recognition with Uncertainty Quantification and Explainable AI},
author={Yoon, Sungwook},
journal={arXiv preprint},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions and issues, please contact:
- Sungwook Yoon: [email protected], [email protected]
- Affiliation: Gyeongbuk Development Institute