Skip to content

SungwookYoon/4tbmer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer-Based Multimodal Emotion Recognition with Uncertainty Quantification and Explainable AI

Authors: Sungwook Yoon
Affiliation: Gyeongbuk Development Institute
Email: [email protected], [email protected]

Abstract

This repository provides a complete implementation of a transformer-based multimodal emotion recognition framework that integrates uncertainty quantification and explainable AI capabilities. The system processes textual, auditory, and visual modalities through specialized encoders and fuses them via sophisticated cross-modal attention mechanisms.

Features

  • Multimodal Architecture: Hierarchical transformer design with specialized encoders for text, audio, and video
  • Uncertainty Quantification: Monte Carlo dropout and deep ensemble techniques for calibrated confidence estimates
  • Explainable AI: Attention visualizations, feature importance scores, and counterfactual explanations
  • State-of-the-art Performance: 93.7% accuracy on MELD dataset with excellent calibration (ECE=0.031)

Installation

pip install -r requirements.txt

Quick Start

  1. Setup Data:
python scripts/download_datasets.py --use_dummy
  1. Reproduction Training:
python train_reproduction.py --epochs 10
  1. Evaluation:
python evaluate.py --config configs/reproduction_config.yaml
  1. Inference:
python inference.py --model_path logs/reproduction/best_model.pth

Dataset Support

  • MELD: Multimodal EmotionLines Dataset (Primary)
  • IEMOCAP: Interactive Emotional Dyadic Motion Capture
  • CMU-MOSEI: CMU Multimodal Opinion Sentiment and Emotion Intensity
  • CMU-MOSI: CMU Multimodal Opinion Sentiment and Intensity
  • AVEC2019: Audio/Visual Emotion Challenge 2019

Model Architecture

Encoders

  • Text: BERT-base (768 dim, 12 layers, 12 heads)
  • Audio: CNN + Transformer (512 dim, 6 layers, 8 heads)
  • Video: ResNet50 + Transformer (512 dim, 4 layers, 8 heads)

Fusion

  • Cross-Modal Attention: 8-head attention, 4 layers
  • Fusion Dimension: 768
  • Method: Hierarchical fusion

Uncertainty

  • Method: Ensemble + Monte Carlo Dropout
  • MC Samples: 100
  • Ensemble Size: 5
  • Calibration: Temperature scaling

Configuration

Key hyperparameters are defined in configs/reproduction_config.yaml:

training:
  learning_rate: 2e-5
  batch_size: 16
  epochs: 50
  
model:
  fusion:
    fusion_dim: 768
    num_heads: 8
    num_layers: 4

Results

Dataset Accuracy F1-Score ECE
MELD 93.7% 92.8% 0.031
IEMOCAP 91.8% 90.5% 0.045
MOSEI 89.2% 88.1% 0.052

Citation

@article{yoon2024transformer,
  title={Transformer-Based Multimodal Emotion Recognition with Uncertainty Quantification and Explainable AI},
  author={Yoon, Sungwook},
  journal={arXiv preprint},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions and issues, please contact:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages