🎉 Accepted at EMNLP 2025 (Main)! 🎉 — ACL Anthology
This repository contains the official implementation of the paper "Machine-generated text detection prevents language model collapse" by George Drayson, Emine Yilmaz, and Vasileios Lampos.
As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. This project investigates the impact of decoding strategy on model collapse and proposes an importance sampling approach to alleviate model collapse.
.
├── config/ # Configuration files
├── data/ # Dataset directory
├── src/ # Source code
│ ├── train.py # Training script
│ ├── generate.py # Generation script
│ ├── load_data.py # Load all dataset files
│ └── utils/ # Utility functions
├── requirements.txt # Python dependencies
└── main.py # Main training loop
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`- Install transformers from source:
pip install git+https://github.com/huggingface/transformers- Install other requirements:
pip install -r requirements.txt- Load and prepare the dataset:
python src/load_data.py- To start the recursive training process:
python main.pyThe training process can be customised using different configuration files (see Configuration section below).
The project uses Hydra for configuration management. Key configuration files are located in the config/ directory:
config.yaml: Main configuration filedecoding/: Decoding-specific configurationsmodel/: Model-specific configurationsdetector/: Detector-specific configurationstrain/: Training hyperparameters
You can override any configuration parameter from the command line:
# Change training parameters
python main.py train.batch_size=16 train.num_train_epochs=5
# Modify decoding strategy
python main.py decoding=beam_search
# Use a different detector
python main.py detector.model_path=detectors/custom_detectorFor more details on configuration options, see the Hydra documentation.
The trained model is available on the Hugging Face Hub at GeorgeDrayson/modernbert-ai-detection. It is a fine-tuned version of ModernBERT-base trained on the MAGE dataset for machine-generated text detection.
- Model Size: 150M parameters
- Base Model: answerdotai/ModernBERT-base
- Dataset: yaful/MAGE
- Task: Text Classification
The project uses Weights & Biases for experiment tracking. Results, metrics, and artifacts are automatically logged during training. To view your results set your API key: export WANDB_API_KEY=your_key_here
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For questions about the code or paper, please open an issue in this repository.
If you use this code in your research, please cite our paper:
@inproceedings{drayson2025machine,
title={Machine-generated text detection prevents language model collapse},
author={Drayson, George and Yilmaz, Emine and Lampos, Vasileios},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://aclanthology.org/2025.emnlp-main.1506}
}