SemEval 2026 Task 9 — Multilingual Polarization Detection

A Comprehensive Study Using Transformer Baselines, LoRA, and LLM Prompting

This repository contains the full experimental pipeline, models, and analysis for SemEval 2026 Task 9 – Subtask 1: Multilingual Polarization Detection, a binary classification task across 22 languages.

Our study evaluates traditional multilingual transformer baselines, generative encoder–decoder models, parameter-efficient LoRA variants, and both zero-shot and few-shot LLM prompting.

Authors: Kamal Poshala, Kushi Reddy Kankar, and Rohan Mukka.

📌 Overview

Polarization involves ideological hostility, adversarial rhetoric, and divisive language. The task requires building models that generalize across heterogeneous linguistic groups, including:

High-resource languages
Medium-resource languages
Low-resource languages
Noisy / morphologically complex languages

Our project analyzes these challenges and proposes a unified multilingual modeling pipeline.

🔬 Methods Summary

Our study evaluates four major modeling families:

1. mBERT Baseline

Encoder-only model
WordPiece tokenizer
Performance: Strong on high-resource languages, weaker on noisy text.

2. mT5 Seq2Seq

Reformulates classification as text generation
Performance: Better on structured languages; unstable on noisy or morphologically complex languages.

3. LoRA Variants

Variants: Multilingual LoRA, Per-language LoRA, MixLoRA (expert routing)
Performance: Most efficient & best overall generalization. Top performer across resource tiers.

4. LLM Prompting (Zero-shot & Few-shot)

Performance: Few-shot produces the highest accuracy for English. Useful for languages with strong LLM pretraining signals.

📊 Key Results

Overall performance across languages (Macro F1):

Difficulty Tier	mBERT	mT5	Few-shot LLM	Multilingual LoRA
High-resource	0.70–0.75	0.75–0.85	0.80 (EN)	0.82–0.87
Medium-resource	0.60–0.70	0.65–0.78	0.70–0.76	0.78–0.84
Low-resource	0.55–0.65	0.55–0.70	0.60–0.72	0.74–0.82
Noisy/Complex	0.50–0.60	0.50–0.68	0.65–0.70	0.72–0.80

🏆 Best Overall Method: Multilingual LoRA

Most stable
Best cross-lingual transfer
Strongest performance in LR/NC languages

⚙️ Training & Implementation

Preprocessing

Unicode normalization (NFKC)
URLs & mentions masked
Emojis, punctuation, emphasis markers preserved
mBERT: WordPiece tokenizer
mT5: SentencePiece tokenizer

Hyperparameters

mBERT: lr 2e-5–5e-5, batch 16–32, epochs 3–5
mT5: lr 1e-4 (Adafactor), batch 8–16
LoRA: rank 8, α = 16, lr 3e-4

Evaluation Metric

Macro F1, due to class imbalance and multilingual fairness.

🧠 Key Insights

LoRA adapters preserve multilingual knowledge while specializing efficiently.
LLM few-shot prompting is strongest when English dominates pretraining.
mT5 behaves well on structured languages but is unstable on noisy text.
mBERT saturates early, unable to capture complex ideological cues.

⚠️ Ethical Considerations

Polarization detection intersects with sociopolitical sensitivity. Important risks include:

Misclassification bias across cultures
Possible misuse for censorship
Dialect and orthographic bias
Privacy concerns in social media data

📌 Citation

If you use this work, please cite the project authors:

Kamal Poshala, Kushi Reddy Kankar, Rohan Mukka
SemEval 2026 – Multilingual Polarization Detection (Task 9)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
train		train
.gitignore		.gitignore
LLM.ipynb		LLM.ipynb
README.md		README.md
SemEval_LoRA (1).ipynb		SemEval_LoRA (1).ipynb
mt5.ipynb		mt5.ipynb
multiling_UNcased_final.ipynb		multiling_UNcased_final.ipynb
multiling_cased.ipynb		multiling_cased.ipynb
multiling_cased_final (1).ipynb		multiling_cased_final (1).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemEval 2026 Task 9 — Multilingual Polarization Detection

A Comprehensive Study Using Transformer Baselines, LoRA, and LLM Prompting

📌 Overview

🔬 Methods Summary

1. mBERT Baseline

2. mT5 Seq2Seq

3. LoRA Variants

4. LLM Prompting (Zero-shot & Few-shot)

📊 Key Results

⚙️ Training & Implementation

Preprocessing

Hyperparameters

Evaluation Metric

🧠 Key Insights

⚠️ Ethical Considerations

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemEval 2026 Task 9 — Multilingual Polarization Detection

A Comprehensive Study Using Transformer Baselines, LoRA, and LLM Prompting

📌 Overview

🔬 Methods Summary

1. mBERT Baseline

2. mT5 Seq2Seq

3. LoRA Variants

4. LLM Prompting (Zero-shot & Few-shot)

📊 Key Results

⚙️ Training & Implementation

Preprocessing

Hyperparameters

Evaluation Metric

🧠 Key Insights

⚠️ Ethical Considerations

📌 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages