This folder contains the code for ANCE training with the implicit Distributionally Robust Optimization (iDRO) stragegy.
The command to install the required package is in commands/install.sh. (Note that there is some differences between the used package in BM25 warmup and COCO Pretraining.)
The command with our used parameters to train this warmup checkpoint is in commands/run_ance.sh.
An example for using the script for tokenizing the dataset is shown as belows: Note that we do not use title information
python ../data/msmarco_data.py \
--data_dir ${training_data_dir} \
--out_data_dir ${preprocessed_data_dir} \
--model_type $model_type \
--model_name_or_path ${pretrained_checkpoint_dir} \
--max_seq_length 256 \
--data_type 1
drivers/run_ann_data_gen.py provides the code to generates the hard negatives from the current (latest) checkpoint. An example usage of the script is given below:
python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345 ../drivers/run_ann_data_gen.py \
--training_dir $saved_models_dir \
--init_model_dir $pretrained_checkpoint_dir \
--model_type rdot_nll_condenser \
--output_dir $model_ann_data_dir \
--cache_dir "${model_ann_data_dir}cache/" \
--data_dir $preprocessed_data_dir \
--max_seq_length 256 \
--per_gpu_eval_batch_size 512 \
--topk_training 200 \
--negative_sample 30 \
--end_output_num 0 \
--result_dir ${result_dir} \
--group 0 \
--public_ann_data_dir=${public_ann_data_dir} \
--cluster_query \
--cluster_centroids 50
Here
training_diris the directory for saving model checkpoints during training. If this directory is empty, then the checkpoint frominit_model_dirwill be used.output_diris the directory for saving the embedding file and the calculated hard negative samples.result_diris the directory for saving the evalation results on MS MARCO. You probably need to use the code in theevaluatefolder with the trained checkpoint to evaluate on BEIR tasks.cluster_querystands for clustering the training data (used in iDRO) based on the query embeddings.cluster_centroidsis used for setting the number of the clusters.
python -m torch.distributed.launch --nproc_per_node=8 --master_port 21345 ../drivers/run_ann.py \
--model_type $model_type \
--model_name_or_path $pretrained_checkpoint_dir \
--task_name MSMarco \
--training_dir ${saved_models_dir} \
--init_model_dir ${pretrained_checkpoint_dir} \
--triplet \
--data_dir $preprocessed_data_dir \
--ann_dir $model_ann_data_dir \
--max_seq_length $seq_length \
--per_gpu_train_batch_size $per_gpu_train_batch_size \
--per_gpu_eval_batch_size 512 \
--gradient_accumulation_steps 1 \
--learning_rate $learning_rate \
--output_dir $saved_models_dir \
--warmup_steps $warmup_steps \
--logging_steps 1000 \
--save_steps 3000 \
--max_steps ${MAX_STEPS} \
--single_warmup \
--optimizer lamb \
--fp16 \
--log_dir $TSB_OUTPUT_DIR \
--model_size ${MODEL_SIZE} \
--result_dir ${result_dir} \
--group ${group} \
--n_groups ${CLUSTER_NUM} \
--dro_type ${DRO_TYPE} \
--alpha ${alpha} \
--eps ${eps} \
--ema ${ema} \
--rho ${rho} \
--round ${i}
Here
saved_models_diris the folder for saving the checkpoints during training.dro_typeis the type of the DRO algorithm. We provide two implementations:iDRO(the main method for this work) anddro-greedy(the original DRO method in this paper).model_sizeis the size of the model (base/large).roundis the current episode (0 stands for the 1st episode for ANCE).alpha,ema,rho,epsare hyperparameters for the DRO.
| Hyperparameter | Value |
|---|---|
| Max Learning Rate | 5e-6 for base / large |
| Warmup Steps | 3000 / 3000 for base / large |
| Max Training Steps for Each Episode | 45000 / 30000 for base / large |
| Batch Size per GPU | 64 / 32 for base / large |
| n_groups | 50 for base / large |
| alpha | 0.25 for base / large |
| ema | 0.1 for base / large |
| rho | 0.05 for base / large |
| eps | 0.01 for base / large |