Skip to content

GeorgeDrayson/model_collapse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine-Generated Text Detection Prevents Language Model Collapse

EMNLP 2025 License

🎉 Accepted at EMNLP 2025 (Main)! 🎉 — ACL Anthology

This repository contains the official implementation of the paper "Machine-generated text detection prevents language model collapse" by George Drayson, Emine Yilmaz, and Vasileios Lampos.

Overview

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. This project investigates the impact of decoding strategy on model collapse and proposes an importance sampling approach to alleviate model collapse.

Repository Structure

.
├── config/              # Configuration files
├── data/               # Dataset directory
├── src/               # Source code
│   ├── train.py       # Training script
│   ├── generate.py    # Generation script
│   ├── load_data.py   # Load all dataset files
│   └── utils/         # Utility functions
├── requirements.txt    # Python dependencies
└── main.py            # Main training loop

Installation

  1. Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows, use `.venv\Scripts\activate`
  1. Install transformers from source:
pip install git+https://github.com/huggingface/transformers
  1. Install other requirements:
pip install -r requirements.txt

Quickstart

  1. Load and prepare the dataset:
python src/load_data.py
  1. To start the recursive training process:
python main.py

The training process can be customised using different configuration files (see Configuration section below).

Configuration

The project uses Hydra for configuration management. Key configuration files are located in the config/ directory:

  • config.yaml: Main configuration file
  • decoding/: Decoding-specific configurations
  • model/: Model-specific configurations
  • detector/: Detector-specific configurations
  • train/: Training hyperparameters

Running with Different Configurations

You can override any configuration parameter from the command line:

# Change training parameters
python main.py train.batch_size=16 train.num_train_epochs=5

# Modify decoding strategy
python main.py decoding=beam_search

# Use a different detector
python main.py detector.model_path=detectors/custom_detector

For more details on configuration options, see the Hydra documentation.

Model

The trained model is available on the Hugging Face Hub at GeorgeDrayson/modernbert-ai-detection. It is a fine-tuned version of ModernBERT-base trained on the MAGE dataset for machine-generated text detection.

Model Details

  • Model Size: 150M parameters
  • Base Model: answerdotai/ModernBERT-base
  • Dataset: yaful/MAGE
  • Task: Text Classification

Experiment Tracking

The project uses Weights & Biases for experiment tracking. Results, metrics, and artifacts are automatically logged during training. To view your results set your API key: export WANDB_API_KEY=your_key_here

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contact

For questions about the code or paper, please open an issue in this repository.

Citation

If you use this code in your research, please cite our paper:

@inproceedings{drayson2025machine,
  title={Machine-generated text detection prevents language model collapse},
  author={Drayson, George and Yilmaz, Emine and Lampos, Vasileios},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  url={https://aclanthology.org/2025.emnlp-main.1506}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages