While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proven to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detectors on Tiny ImageNet, CUB-200, Caltech-256, WebVision, and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.
- pytorch >= 1.7.1
- torchvision >= 0.4
- scikit-learn
- numpy
- pandas
mislabel-detection/
├── runner.py # Core pipeline: dataset creation, training, evaluation, training dynamics generation
├── train_detector.py # Train LSTM noise detector on training dynamics
├── losses.py # Loss functions (cross-entropy, Reed soft/hard)
├── util.py # Utilities: AverageMeter, Welford statistics, training dynamics I/O
├── models/ # Classification model architectures
│ ├── resnet.py # ResNet (various depths)
│ ├── wide_resnet.py # WideResNet
│ ├── densenet.py # DenseNet
│ ├── PreResNet.py # Pre-activation ResNet
│ ├── vgg.py # VGG
│ ├── conv4.py # Conv4
│ └── lenet.py # LeNet / LeNetMNIST
├── detector_model/ # Noise detector
│ ├── lstm.py # LSTM binary classifier
│ └── predict.py # Ranking & evaluation (get_order)
├── data_debug_dm/ # Data debugging experiments (Chapter 4.3)
│ ├── Train_cub.py # CUB-200 co-training with MixMatch
│ ├── Train_webvision.py # WebVision co-training with MixMatch
│ ├── dataloader_cub.py # CUB dataloader with noise injection
│ ├── dataloader_webvision.py # WebVision dataloader
│ ├── cub_200_201/ # CUB-200 noise/repair label files
│ └── mini_webvision/ # WebVision repair label files
└── run.sh # Unified experiment runner (all commands)
The project follows a 3-phase pipeline. All experiments are launched via ./run.sh <command>:
- Generate Training Dynamics (
./run.sh generate_td): Train a classification model and record per-sample prediction probabilities across all epochs. - Train Noise Detector (
python train_detector.py): Train an LSTM to classify samples as clean or mislabeled, using the training dynamics as input sequences. - Denoise & Retrain (
./run.sh denoise_small/./run.sh denoise_large): Use the trained detector to rank and remove suspected mislabeled samples, then retrain on the cleaned data. - Data Debugging (
./run.sh train_cub/./run.sh train_webvision): Label correction experiments using co-training with MixMatch (Chapter 4.3).
Run ./run.sh help for full usage information. Set TESTRUN=1 to print commands without executing.
We run experiments on 5 small datasets:
- CIFAR-10
- CIFAR-100
- Tiny ImageNet
- CUB-200-2011
- Caltech-256
... and 2 large datasets:
- WebVision50 (subset of WebVision)
- Clothing100K (subset of Clothing1M)
We use the same subset as AUM for these two subsets. Click Here to download and untar the file to access CUB-200-2011 and Caltech-256.
- Acquire metadata and training dynamics for manually corrupted or real-world datasets.
- Train an LSTM model as a noise detector.
- Retrain a new model on cleaned data:
- Metrics of label noise detection on synthesized datasets (CIFAR-10/100, Tiny ImageNet) and retraining on clean data
- Less overfitting towards noisy labels on real-world datasets (WebVision50 and Clothing100K)
./run.sh generate_td <datadir> <dataset> <seed> <noise_ratio> <noise_type> <net_type> <depth>
# Generate td for small datasets [no manual corruption]
CUDA_VISIBLE_DEVICES=0 ./run.sh generate_td "/root/codespace/datasets" "cifar10" 1 0. "uniform" "resnet" 32
# Generate td for small datasets [uniform 0.2 noisy]
CUDA_VISIBLE_DEVICES=0 ./run.sh generate_td "/root/codespace/datasets" "tiny_imagenet" 1 0.2 "uniform" "resnet" 32
# Generate td for large datasets [noise_ratio and noise_type are mute]
CUDA_VISIBLE_DEVICES=0 ./run.sh generate_td "/root/codespace/datasets" "webvision50" 1 0. "uniform" "resnet" 50Arguments:
<datadir>- Path to datasets folder, structured as:datasets/ ├── cifar10/ │ └── cifar-10-batches-py/ │ ├── data_batch_1 │ └── ... └── cifar100/ └── cifar-100-python/ ├── meta └── ...<dataset>- Which dataset to use (default:cifar10)<seed>- Random seed (default:0)<noise_type>- Noise type:uniform(symmetric) orflip(asymmetric) (default:uniform)<noise_ratio>- Fraction of labels to corrupt (default:0.2)<net_type>- Model architecture, seemodels/(default:resnet)<depth>- Model depth, e.g. 32 for ResNet-32 (default:32)
The script calls class
_Dataset(inrunner.py) to corrupt the dataset with the given seed, noise_type and noise_ratio, then callsRunner.train_for_td_computationto save metadata and acquire training dynamics.
Output (saved to computation4td_seed{seed}/):
| File | Description |
|---|---|
model.pth |
Best model (early-stopped) |
model.pth.last |
Last epoch model |
train_log.csv |
Training log (epoch, train_error, train_loss, valid_error, valid_top5_error, valid_loss) |
results_valid.csv |
Per-sample validation results (index, Loss, Prediction, Confidence, Label) |
metadata.pth |
Corruption info: train_indices, valid_indices, true_targets, assigned_targets, label_flipped |
training_dynamics.npz |
Training dynamics: td (shape: [n_samples, n_classes, n_epochs]) and labels (shape: [n_samples, n_classes]) |
# Train a 2-layer LSTM with noisy 0.2 CIFAR-10
CUDA_VISIBLE_DEVICES=0 python train_detector.py --r 0.2 --dataset cifar10 \
--files_path "./replication/cifar10_resnet32_percmislabeled0.2_uniform/computation4td_seed1"
# Fine-tune a 2-layer LSTM with noisy 0.2 CUB based on an existing detector
CUDA_VISIBLE_DEVICES=0 python train_detector.py --r 0.2 --dataset cub_200_2011 \
--files_path "./replication/cifar10_resnet34_percmislabeled0.2_uniform/computation4td_seed1" \
--resume "cifar100_0.3_lstm_detector.pth.tar"Two pre-trained LSTM detectors are provided as defaults:
cifar10_0.2_lstm_detector.pth.tar- better for CIFAR-10 taskscifar100_0.3_lstm_detector.pth.tar- better for CIFAR-100, Clothing100K, and WebVision50
./run.sh denoise_smallcallsRunner.train, which first invokesRunner.subsetto detect and remove mislabeled samples. Detection metrics (AUC and mAP) are reported viaget_order()fromdetector_model/predict.py. Then the model is retrained on the clean subset.
./run.sh denoise_small <datadir> <dataset> <seed> <noise_ratio> <noise_type> <detector_file> <remove_ratio>
# Denoise symmetric CIFAR-10
detector_file='cifar10_0.2_lstm_detector.pth.tar'
for remove_ratio in 0.15 0.2 0.25; do
CUDA_VISIBLE_DEVICES=0 ./run.sh denoise_small "/root/codespace/datasets" "cifar10" 1 0.2 "uniform" ${detector_file} ${remove_ratio}
done
# Denoise asymmetric CIFAR-100
for remove_ratio in 0.35 0.4 0.45; do
CUDA_VISIBLE_DEVICES=0 ./run.sh denoise_small "/root/codespace/datasets" "cifar100" 1 0.4 "asym" ${detector_file} ${remove_ratio}
doneAfter ranking all training samples,
Runner.trainselects a cleaner subset to retrain a new model. Requirestraining_dynamics.npzfrom Step 1 and<detector_file>from Step 2.
./run.sh denoise_large <datadir> <dataset> <seed> <detector_file> <remove_ratio>
detector_file='cifar100_0.3_lstm_detector.pth.tar'
remove_ratio=0.2
# Denoise WebVision50
CUDA_VISIBLE_DEVICES=0 ./run.sh denoise_large "/root/codespace/datasets" "webvision50" 1 ${detector_file} ${remove_ratio}
# Denoise Clothing100K
CUDA_VISIBLE_DEVICES=0 ./run.sh denoise_large "/root/codespace/datasets" "clothing100k" 1 ${detector_file} ${remove_ratio}Arguments:
<detector_file>- Path to the trained LSTM noise detector from Step 2<remove_ratio>- Fraction of samples to remove (suspected mislabeled)
Output (saved to prune4retrain_seed{seed}/):
| File | Description |
|---|---|
model.pth |
Best model trained on clean subset |
model.pth.last |
Last epoch model |
train_log.csv |
Training log |
results_valid.csv |
Per-sample validation results |
In Chapter 4.3, we apply a data debugging strategy to further boost SOTA performance. Using a detector trained on noisy CIFAR-100, we first select the most suspicious samples as label noise. We train a new model on the clean part of the dataset. The labels of these samples are then replaced by error-free ones (using ground truth labels for CUB and model predictions for WebVision), namely data debugging.
Implementation details are in
data_debug_dm/cub_200_201/anddata_debug_dm/mini_webvision/. Based on the source code of DivideMix and AugDesc, we mainly modify the datasets' label reading part. We provide the modified dataloaders and trainers for experiments on CUB-200-2011 and mini WebVision.
# CUB-200 baseline (20% symmetric noise)
CUDA_VISIBLE_DEVICES=0 ./run.sh train_cub --r 0.2 --noise_mode sym --num_epochs 300
# CUB-200 with 5% label repair
CUDA_VISIBLE_DEVICES=0 ./run.sh train_cub --r 0.2 --noise_mode sym --repair_ratio 0.05 --num_epochs 300
# WebVision with 5% label repair
CUDA_VISIBLE_DEVICES=0 ./run.sh train_webvision --repair_ratio 0.05 --num_epochs 100
# WebVision with 10% label repair
CUDA_VISIBLE_DEVICES=0 ./run.sh train_webvision --repair_ratio 0.10 --num_epochs 100If you make use of our work, please cite our paper:
@article{jia2022learning,
title={Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features},
author={Jia, Qingrui and Li, Xuhong and Yu, Lei and Bian, Jiang and Zhao, Penghao and Li, Shupeng and Xiong, Haoyi and Dou, Dejing},
journal={arXiv preprint arXiv:2212.09321},
year={2022}
}The implementation is based on AUM code. Part of experiments is based on DivideMix and AugDesc. Thanks for their brilliant works!