If you want to train your own language model on a Wikipedia in your chosen language,
run prepare_wiki.sh. The script will ask for a language and will then
download, extract, and prepare the latest version of Wikipedia for the chosen language.
Example command: bash prepare_wiki.sh
This will create a data folder in this directory and wiki_dumps, wiki_extr, and
wiki subfolders. In each subfolder, it will furthermore create a folder LANG
where LANG is the language of the Wikipedia. The prepared files are stored in
wiki/LANG as train.csv and val.csv to match the format used for text
classification datasets. By default, train.csv contains around 100 million tokens
and val.csv is 10% the size of train.csv.
Run create_toks.py to tokenize the input texts.
Example command: python create_toks.py data/imdb
Usage:
create_toks.py DIR_PATH [CHUNKSIZE] [N_LBLS] [LANG]
create_toks.py --dir-path DIR_PATH [--chunksize CHUNKSIZE] [--n-lbls N_LBLS] [--lang LANG]
DIR_PATH: the directory where your data is locatedCHUNKSIZE: the size of the chunks when reading the files with pandas; use smaller sizes with less RAMLANG: the language of your corpus.
train.csv and val.csv files should be in DIR_PATH. The script will then save the
training and test tokens and labels as arrays to binary files in NumPy format in a tmp
in the above path in the following files:
tok_trn.npy, tok_val.npy, lbl_trn.npy, and lbl_val.npy.
In addition, a joined corpus containing white space-separated tokens is produced in tmp/joined.txt.
Run tok2id.py to map the tokens in the tok_trn.npy and tok_val.npy files to ids.
Example command: python tok2id.py data/imdb
Usage:
tok2id.py PREFIX [MAX_VOCAB] [MIN_FREQ]
tok2id.py --prefix PREFIX [--max-vocab MAX_VOCAB] [--min-freq MIN_FREQ]
PREFIX: the file path prefix indata/nlp_clas/{prefix}MAX_VOCAB: the maximum vocabulary sizeMIN_FREQ: the minimum frequency of words that should be kept
Before fine-tuning the language model, you can run pretrain_lm.py to create a
pre-trained language model using WikiText-103 (or whatever base corpus you prefer).
Example command: python pretrain_lm.py data/wiki/de/ 0 --lr 1e-3 --cl 12
Usage:
pretrain_lm.py DIR_PATH CUDA_ID [CL] [BS] [BACKWARDS] [LR] [SAMPLED] [PRETRAIN_ID]
pretrain_lm.py --dir-path DIR_PATH --cuda-id CUDA_ID [--cl CL] [--bs BS] [--backwards BACKWARDS] [--lr LR] [--sampled SAMPLED] [--pretrain-id PRETRAIN_ID]
DIR_PATH: the directory that contains the Wikipedia filesCUDA_ID: the id of the GPU that should be used;CL: the # of epochs to trainBS: the batch sizeBACKWARDS: whether a backwards LM should be trainedLR: the learning rateSAMPLED: whether a sampled softmax should be used (default:True)PRETRAIN_ID: the id used for saving the trained LM
You might have to adapt the learning rate and the # of epochs to maximize performance.
Alternately, you can download the pre-trained models here. Before,
create a directory wt103. In wt103, create a models and a tmp folder. Save the model files
in the models folder and itos_wt103.pkl, the word-to-token mapping, to the tmp folder.
Then run finetune_lm.py to fine-tune a language model pretrained on WikiText-103 data on the target task data.
Example command: python finetune_lm.py data/imdb data/wt103 1 25 --lm-id pretrain_wt103
Usage:
finetune_lm.py DIR_PATH PRETRAIN_PATH [CUDA_ID] [CL] [PRETRAIN_ID] [LM_ID] [BS] [DROPMULT] [BACKWARDS] [LR] [PRELOAD] [BPE] [STARTAT] [USE_CLR] [USE_REGULAR_SCHEDULE] [USE_DISCRIMINATIVE] [NOTRAIN] [JOINED] [TRAIN_FILE_ID] [EARLY_STOPPING]
finetune_lm.py --dir-path DIR_PATH --pretrain-path PRETRAIN_PATH [--cuda-id CUDA_ID] [--cl CL] [--pretrain-id PRETRAIN_ID] [--lm-id LM_ID] [--bs BS] [--dropmult DROPMULT] [--backwards BACKWARDS] [--lr LR] [--preload PRELOAD] [--bpe BPE] [--startat STARTAT] [--use-clr USE_CLR] [--use-regular-schedule USE_REGULAR_SCHEDULE] [--use-discriminative USE_DISCRIMINATIVE] [--notrain NOTRAIN] [--joined JOINED] [--train-file-id TRAIN_FILE_ID] [--early-stopping EARLY_STOPPING]
DIR_PATH: the directory where thetmpandmodelsfolder are locatedPRETRAIN_PATH: the path where the pretrained model is saved; if using the downloaded model, this iswt103CUDA_ID: the id of the GPU used for training the modelCL: number of epochs to train the modelPRETRAIN_ID: the id of the pretrained model; set towt103per defaultLM_ID: the id used for saving the fine-tuned language modelBS: the batch size used for training the modelDROPMULT: the factor used to multiply the dropout parametersBACKWARDS: whether a backwards LM is trainedLR: the learning ratePRELOAD: whether we load a pretrained LM (Trueby default)BPE: whether we use byte-pair encoding (BPE)STARTAT: can be used to continue fine-tuning a model; if>0, loads an already fine-tuned LM; can also be used to indicate the layer at which to start the gradual unfreezing (1is last hidden layer, etc.); in the final model, we only used this for training the classifierUSE_CLR: whether to use slanted triangular learning rates (STLR) (Trueby default)USE_REGULAR_SCHEDULE: whether to use a regular schedule (instead of STLR)USE_DISCRIMINATIVE: whether to use discriminative fine-tuning (Trueby default)NOTRAIN: whether to skip fine-tuningJOINED: whether to fine-tune the LM on the concatenation of training and validation dataTRAIN_FILE_ID: can be used to indicate different training files (e.g. to test training sizes)EARLY_STOPPING: whether to use early stopping
The language model is fine-tuned using warm-up reverse annealing and triangular learning rates. For IMDb,
we set --cl, the number of epochs to 50 and used a learning rate --lr of 4e-3.
Run train_clas.py to train the classifier on top of the fine-tuned language model with gradual unfreezing,
discriminative fine-tuning, and slanted triangular learning rates.
Example command: python train_clas.py data/imdb 0 --lm-id pretrain_wt103 --clas-id pretrain_wt103 --cl 50
Usage:
train_clas.py DIR_PATH CUDA_ID [LM_ID] [CLAS_ID] [BS] [CL] [BACKWARDS] [STARTAT] [UNFREEZE] [LR] [DROPMULT] [BPE] [USE_CLR] [USE_REGULAR_SCHEDULE] [USE_DISCRIMINATIVE] [LAST] [CHAIN_THAW] [FROM_SCRATCH] [TRAIN_FILE_ID]
train_clas.py --dir-path DIR_PATH --cuda-id CUDA_ID [--lm-id LM_ID] [--clas-id CLAS_ID] [--bs BS] [--cl CL] [--backwards BACKWARDS] [--startat STARTAT] [--unfreeze UNFREEZE] [--lr LR] [--dropmult DROPMULT] [--bpe BPE] [--use-clr USE_CLR] [--use-regular-schedule USE_REGULAR_SCHEDULE] [--use-discriminative USE_DISCRIMINATIVE] [--last LAST] [--chain-thaw CHAIN_THAW] [--from-scratch FROM_SCRATCH] [--train-file-id TRAIN_FILE_ID]
DIR_PATH: the directory where thetmpandmodelsfolder are locatedCUDA_ID: the id of the GPU used for training the modelLM_ID: the id of the fine-tuned language model that should be loadedCLAS_ID: the id used for saving the classifierBS: the batch size used for training the modelCL: the number of epochs to train the model with all layers unfrozenBACKWARDS: whether a backwards LM is trainedSTARTAT: whether to use gradual unfreezing (0) or load the pretrained model (1)UNFREEZE: whether to unfreeze the whole network (after optional gradual unfreezing) or train only the final classifier layer (default isTrue)LR: the learning rateDROPMULT: the factor used to multiply the dropout parametersBPE: whether we use byte-pair encoding (BPE)USE_CLR: whether to use slanted triangular learning rates (STLR) (Trueby default)USE_REGULAR_SCHEDULE: whether to use a regular schedule (instead of STLR)USE_DISCRIMINATIVE: whether to use discriminative fine-tuning (Trueby default)LAST: whether to fine-tune only the last layer of the modelCHAIN_THAW: whether to use chain-thawFROM_SCRATCH: whether to train the model from scratch (without loading a pretrained model)TRAIN_FILE_ID: can be used to indicate different training files (e.g. to test training sizes)
For fine-tuning the classifier on IMDb, we set --cl, the number of epochs to 50.
Run eval_clas.py to get the classifier accuracy and confusion matrix.
This requires the files produced during the training process: itos.pkl and the classifier (named clas_1.h5 by default), as well as the npy files containing the evaluation samples and labels.
Example command: python eval_clas.py data/imdb 0 --lm-id pretrain_wt103 --clas-id pretrain_wt103
Usage:
eval_clas.py DIR_PATH CUDA_ID [LM_ID] [CLAS_ID] [BS] [BACKWARDS] [BPE]
eval_clas.py --dir-path DIR_PATH --cuda-id CUDA_ID [--lm-id LM_ID] [--clas-id CLAS_ID] [--bs BS] [--bpe BPE]
DIR_PATH: the directory where thetmpandmodelsfolder are locatedCUDA_ID: the id of the GPU used for training the modelLM_ID: the id of the fine-tuned language model that should be loadedCLAS_ID: the id used for saving the classifierBS: the batch size used for training the modelBACKWARDS: whether a backwards LM is trainedBPE: whether we use byte-pair encoding (BPE)
Run predict_with_classifier.py to predict against free text entry.
This requires two files produced during the training process: itos.pkl and the classifier (named clas_1.h5 by default)
Example command: python predict_with_classifier.py trained_models/itos.pkl trained_models/classifier_model.h5
It is suggested to customize this script to your needs.