Toxicity Classification with Adversarial Robustness

📌 Project Overview

This project focuses on training a robust toxicity classification model that can withstand adversarial attacks. The training pipeline includes:

Baseline Classifier: Fine-tuned transformer-based model.
Adversarial Perturbations: Applying adversarial attacks on text inputs.
Reinforcement Learning (RL) Fine-Tuning: Using RL to improve robustness.

⚙️ Environment Setup

1️⃣ Install Required Dependencies

First, ensure you have Python 3.8+ installed. Then manually install the required libraries:

pip install transformers datasets torch torchmetrics accelerate tqdm kaggle spacy

2️⃣ Download Pretrained SpaCy Models

The project relies on SpaCy's word embeddings. Run the following commands to download the required models:

python -m spacy download en_core_web_lg
python -m spacy download en_core_web_md

3️⃣ Set Up Kaggle API Key

To download the dataset from Kaggle, you need to configure your Kaggle API key.

Go to Kaggle API Tokens and download kaggle.json.

Move it to the correct location:

mkdir -p ~/.kaggle
mv path/to/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

📂 Data Setup

Download the dataset using:

kaggle competitions download -c jigsaw-toxic-comment-classification-challenge --force
unzip -o jigsaw-toxic-comment-classification-challenge.zip -d ./jigsaw_toxicity_data
unzip -o ./jigsaw_toxicity_data/train.csv.zip -d ./jigsaw_toxicity_data
unzip -o ./jigsaw_toxicity_data/test.csv.zip -d ./jigsaw_toxicity_data
unzip -o ./jigsaw_toxicity_data/test_labels.csv.zip -d ./jigsaw_toxicity_data

🚀 Running the Project

1️⃣ Training the Baseline Classifier

python ./src/train.py --data_directory ./jigsaw_toxicity_data --model_path ./models/classifier.pt

2️⃣ Evaluation and Testing against Adversarial Examples

python ./src/evaluate.py --data_directory ./jigsaw_toxicity_data --model_path ./models/classifier.pt --adversarial

3️⃣ RL-Based Fine-Tuning for Robustness and Evaluation

python ./src/rl_policy.py --data_directory ./jigsaw_toxicity_data --classifier_model_path ./models/classifier.pt --policy_model_path ./models/policy.pt

🔧 Project Structure

📂 Safety-Alignment-Classifier
│── 📂 jigsaw_toxicity_data      # Contains dataset files
│── 📂 models                    # Saved models
│── 📂 src                       # Training & inference scripts
    |── dataset.py               # Dataset creation
    |── model.py                 # Model creation 
    │── train.py                 # Baseline classifier training
    │── evaluate.py              # Model evaluation
    │── rl_policy.py             # RL fine-tuning and evaluation for robustness

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxicity Classification with Adversarial Robustness

📌 Project Overview

⚙️ Environment Setup

1️⃣ Install Required Dependencies

2️⃣ Download Pretrained SpaCy Models

3️⃣ Set Up Kaggle API Key

📂 Data Setup

🚀 Running the Project

1️⃣ Training the Baseline Classifier

2️⃣ Evaluation and Testing against Adversarial Examples

3️⃣ RL-Based Fine-Tuning for Robustness and Evaluation

🔧 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toxicity Classification with Adversarial Robustness

📌 Project Overview

⚙️ Environment Setup

1️⃣ Install Required Dependencies

2️⃣ Download Pretrained SpaCy Models

3️⃣ Set Up Kaggle API Key

📂 Data Setup

🚀 Running the Project

1️⃣ Training the Baseline Classifier

2️⃣ Evaluation and Testing against Adversarial Examples

3️⃣ RL-Based Fine-Tuning for Robustness and Evaluation

🔧 Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages