Machine-Generated Text Detection Prevents Language Model Collapse

🎉 Accepted at EMNLP 2025 (Main)! 🎉 — ACL Anthology

This repository contains the official implementation of the paper "Machine-generated text detection prevents language model collapse" by George Drayson, Emine Yilmaz, and Vasileios Lampos.

Overview

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. This project investigates the impact of decoding strategy on model collapse and proposes an importance sampling approach to alleviate model collapse.

Repository Structure

.
├── config/              # Configuration files
├── data/               # Dataset directory
├── src/               # Source code
│   ├── train.py       # Training script
│   ├── generate.py    # Generation script
│   ├── load_data.py   # Load all dataset files
│   └── utils/         # Utility functions
├── requirements.txt    # Python dependencies
└── main.py            # Main training loop

Installation

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows, use `.venv\Scripts\activate`

Install transformers from source:

pip install git+https://github.com/huggingface/transformers

Install other requirements:

pip install -r requirements.txt

Quickstart

Load and prepare the dataset:

python src/load_data.py

To start the recursive training process:

python main.py

The training process can be customised using different configuration files (see Configuration section below).

Configuration

The project uses Hydra for configuration management. Key configuration files are located in the config/ directory:

config.yaml: Main configuration file
decoding/: Decoding-specific configurations
model/: Model-specific configurations
detector/: Detector-specific configurations
train/: Training hyperparameters

Running with Different Configurations

You can override any configuration parameter from the command line:

# Change training parameters
python main.py train.batch_size=16 train.num_train_epochs=5

# Modify decoding strategy
python main.py decoding=beam_search

# Use a different detector
python main.py detector.model_path=detectors/custom_detector

For more details on configuration options, see the Hydra documentation.

Model

The trained model is available on the Hugging Face Hub at GeorgeDrayson/modernbert-ai-detection. It is a fine-tuned version of ModernBERT-base trained on the MAGE dataset for machine-generated text detection.

Model Details

Model Size: 150M parameters
Base Model: answerdotai/ModernBERT-base
Dataset: yaful/MAGE
Task: Text Classification

Experiment Tracking

The project uses Weights & Biases for experiment tracking. Results, metrics, and artifacts are automatically logged during training. To view your results set your API key: export WANDB_API_KEY=your_key_here

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

For questions about the code or paper, please open an issue in this repository.

Citation

If you use this code in your research, please cite our paper:

@inproceedings{drayson2025machine,
  title={Machine-generated text detection prevents language model collapse},
  author={Drayson, George and Yilmaz, Emine and Lampos, Vasileios},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  url={https://aclanthology.org/2025.emnlp-main.1506}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Generated Text Detection Prevents Language Model Collapse

Overview

Repository Structure

Installation

Quickstart

Configuration

Running with Different Configurations

Model

Model Details

Experiment Tracking

License

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Machine-Generated Text Detection Prevents Language Model Collapse

Overview

Repository Structure

Installation

Quickstart

Configuration

Running with Different Configurations

Model

Model Details

Experiment Tracking

License

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages