Skip to content

RohanMukka/Multilingual-Polarization-Detection

 
 

Repository files navigation

SemEval 2026 Task 9 — Multilingual Polarization Detection

A Comprehensive Study Using Transformer Baselines, LoRA, and LLM Prompting

This repository contains the full experimental pipeline, models, and analysis for SemEval 2026 Task 9 – Subtask 1: Multilingual Polarization Detection, a binary classification task across 22 languages.

Our study evaluates traditional multilingual transformer baselines, generative encoder–decoder models, parameter-efficient LoRA variants, and both zero-shot and few-shot LLM prompting.

Authors: Kamal Poshala, Kushi Reddy Kankar, and Rohan Mukka.


📌 Overview

Polarization involves ideological hostility, adversarial rhetoric, and divisive language. The task requires building models that generalize across heterogeneous linguistic groups, including:

  • High-resource languages
  • Medium-resource languages
  • Low-resource languages
  • Noisy / morphologically complex languages

Our project analyzes these challenges and proposes a unified multilingual modeling pipeline.


🔬 Methods Summary

Our study evaluates four major modeling families:

1. mBERT Baseline

  • Encoder-only model
  • WordPiece tokenizer
  • Performance: Strong on high-resource languages, weaker on noisy text.

2. mT5 Seq2Seq

  • Reformulates classification as text generation
  • Performance: Better on structured languages; unstable on noisy or morphologically complex languages.

3. LoRA Variants

  • Variants: Multilingual LoRA, Per-language LoRA, MixLoRA (expert routing)
  • Performance: Most efficient & best overall generalization. Top performer across resource tiers.

4. LLM Prompting (Zero-shot & Few-shot)

  • Performance: Few-shot produces the highest accuracy for English. Useful for languages with strong LLM pretraining signals.

📊 Key Results

Overall performance across languages (Macro F1):

Difficulty Tier mBERT mT5 Few-shot LLM Multilingual LoRA
High-resource 0.70–0.75 0.75–0.85 0.80 (EN) 0.82–0.87
Medium-resource 0.60–0.70 0.65–0.78 0.70–0.76 0.78–0.84
Low-resource 0.55–0.65 0.55–0.70 0.60–0.72 0.74–0.82
Noisy/Complex 0.50–0.60 0.50–0.68 0.65–0.70 0.72–0.80

🏆 Best Overall Method: Multilingual LoRA

  • Most stable
  • Best cross-lingual transfer
  • Strongest performance in LR/NC languages

⚙️ Training & Implementation

Preprocessing

  • Unicode normalization (NFKC)
  • URLs & mentions masked
  • Emojis, punctuation, emphasis markers preserved
  • mBERT: WordPiece tokenizer
  • mT5: SentencePiece tokenizer

Hyperparameters

  • mBERT: lr 2e-55e-5, batch 16–32, epochs 3–5
  • mT5: lr 1e-4 (Adafactor), batch 8–16
  • LoRA: rank 8, α = 16, lr 3e-4

Evaluation Metric

  • Macro F1, due to class imbalance and multilingual fairness.

🧠 Key Insights

  • LoRA adapters preserve multilingual knowledge while specializing efficiently.
  • LLM few-shot prompting is strongest when English dominates pretraining.
  • mT5 behaves well on structured languages but is unstable on noisy text.
  • mBERT saturates early, unable to capture complex ideological cues.

⚠️ Ethical Considerations

Polarization detection intersects with sociopolitical sensitivity. Important risks include:

  • Misclassification bias across cultures
  • Possible misuse for censorship
  • Dialect and orthographic bias
  • Privacy concerns in social media data

📌 Citation

If you use this work, please cite the project authors:

Kamal Poshala, Kushi Reddy Kankar, Rohan Mukka
SemEval 2026 – Multilingual Polarization Detection (Task 9)

About

SemEval 2026 Task 9: Multilingual polarization detection across 22 languages using mBERT, LoRA fine-tuning, and LLM prompting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%