The base version of this repo is a clone of Soft-Label Dataset Distillation and Text Dataset Distillation. The experiments in our project can be reproduced by using the commands given in run.sh.
- Python 3
- NVIDIA GPU + CUDA
faiss==1.7.3matplotlib==3.7.1numpy==1.24.3pandas==2.0.2Pillow==9.5.0PyYAML==5.4.1scikit_learn==1.2.2six==1.16.0skimage==0.0torch==1.13.1torchtext==0.6.0torchvision==0.14.1tqdm==4.65.0transformers==4.29.2
The experiments in our project can be reproduced using the commands in run.sh.
Example:
python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT
--batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
--distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
--epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1
--device_id 0 --phase train
This would log everything related to this experiment, including disttled data, in text_results/.
Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):
python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT
--batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
--distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
--epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1
--device_id 0 --phase test
The file docs/advanced.md by the original authors gives a detailed description of useful parameters.
References:
- Soft-Label Dataset Distillation and Text Dataset Distillation Paper
- Dataset Distillation Dataset Distillation: The code in the original repo is written by Tongzhou Wang, Jun-Yan Zhu and Ilia Sucholutsky.