SynthDetectives@ALTA2023 Stacking the Odds: Transformer-based Ensemble for AI-generated Text Detection
Repository for the paper "Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection"
Stacking ensemble of Transformers trained to detect AI-generated text for the ALTA Shared Task 2023.
Abstract: This paper reports our submission under the team name 'SynthDetectives' to the ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the task of AI-generated text detection. Our approach is novel in terms of its choice of models in that we use accessible and lightweight models in the ensemble. We show that ensembling the models results in an improved accuracy in comparison with using them individually. Our approach achieves an accuracy score of 0.9555 on the official test data provided by the shared task organisers.
dataset: datasetsrc: code
The dataset is provided by ALTA Shared Task 2023 on CodaLab
- training.json - 18k evenly split human/machine generated training set with labels
- validation_data.json - 2k validation set without labels
- validation_sample_output.json - 2k dummy validation output for output formatting reference
- test_data.json - 2k testing set used for leaderboard scoring on CodaLab
- helper.py - helpers for EDA and model files
- model.py - model architecture and dataloading
- eda.ipynb - EDA notebook (all cells are preloaded)
- build_embeddings.py - build and save embeddings for each Transformer model on the training set
- train_weak_learners.py - train the weak learners on the embeddings
- train_meta_learner.py - train the meta-learner on the weak learner predictions of the dataset embeddings
- inference.py - perform inference using the ensemble
The training was done on python >= 3.8.10 on Google Cloud Platform's Vertex Colab GPU for GCE usage on NVIDIA A100 (40 GB). It was also previously tested with GeForce RTX 3060 on WSL2 Ubuntu. The configurations which are detailed below will work out-of-the-box for NVIDIA A100 (40 GB). However, for less performant GPUs, the BATCH_SIZE will need to be decreased. All adjustable parameters are recorded as constants at the top of the model files, specifically you ca change the BATCH_SIZE and NUM_EPOCH in train_weak_learners.py and train_meta_learner.py.
- Run
pip install -r requirements.txt
- Ensure
training.jsonexists indatasetfolder. - Run
pip build_embeddings.pyto build[CLS]embedding of the last hidden layer for the dataset using all Transformers (ALBERT, ELECTRA, RoBERTa, XLNet). If your GPU is not great, please reduce theload_batch_sizeinsrc/model.py:TransformerModel.dataset. This will produce the embeddings.ptfile,pretrained--dev=False--model=MODEL.pt, for each of the TransformerMODELvariants above. - Run
pip train_weak_learners.pyto train each of the Transformer weak learner using the previously produced embeddings. This will save the best weights for each weak learner in the following locationlightning_logs/version_VERSION/checkpoints/model=MODEL--dev=False--epoch=EPOCH-step=STEP--val_loss=VAL_LOSS.ckpt. - Update the
checkpointsarray in train_meta_learner.py with the best weight path of each weak learner which was produced in the previous step. Note that the checkpoints have to follow the following order:ALBERT, ELECTRA, RoBERTa, XLNet. - Run
pip train_meta_learner.pyto train the meta-learner Logistic Regression classifier using the best weights of the weak learners. This will save the best weight of the meta-learner.
- Ensure
test_data.jsonexists indatasetfolder. - Ensure you have the weights for each of the weak learner and the meta-learner from the training step.
- Update the
checkpointsarray (for the weak learners) andlr_checkpoint_path(for the meta-learner) in inference.py - Run
pip inference.py. This will produceanswer.jsonwhich contains the inference output.
If our work is useful for your own, you can cite us with the following BibTex entry:
@misc{nguyen2023stacking,
title={Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text Detection},
author={Duke Nguyen and Khaing Myat Noe Naing and Aditya Joshi},
year={2023},
eprint={2310.18906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}