lm

Language Modeling

Build Dictionary

fl_lm_dictionary_builder \
 --data_dir=/tmp \
 --data_train=test1.txt,test2.txt \
 --n_workers=40 \
 --dictionary=dictionary.txt \
 --dictionary_min_appearence=2 \
 --dictionary_max_size=200000 \
 --write_meta=true

Dictionary builder reads all the text files specified in --data_train from --data_dir and count the total number of tokens and sentences in it, using --n_workers threads in parallel. After filtering out the uncommon tokens with --dictionary_min_appearence and limiting the dictionary size by --dictionary_max_size, dictionary builder will save out a dictionary with tokens and their number of appearance in all the text files into --dictionary. If --write_meta is on, the meta data of each text file will be generated in --data_dir with suffix .desc. Meta data describes the beginning position (in byte) of each sentence and the number of tokens in it.

Built dictionary will also contain special tokens at the beginning, so you don't need to tweak this dictionary before training.

</s> - end of sentence
<unk> - unknown token
<pad> - pad token
<mask> - mask token (is needed for BERT training)


## Train

### Training modes
- `train`: Train a model from scratch, and save logs and checkpoints into `exp_rundir/exp_model_name`.
- `continue`: Continue training an existing model in `exp_rundir/exp_model_name`.
- `fork`: Training a new model with weights initialized to the one specified in `--FLAGS_exp_init_model_path`.

### Training tasks
- Auto-regressive training (`--train_task=autoreg`)

fl_lm_train
--exp_rundir=/tmp
--exp_model_name=my_lm
--data_dir=<...>/my_data
--data_valid=data1.txt,data2.txt
--data_sample_break_mode=none
--data_tokens_per_sample=4096
--dictionary=<...>/my_dict.txt
--loss_type=ce
--train_arch_file=/path/to/compiled/myarch.so
--train_max_grad_norm=0.1
--train_report_updates=1000
--train_save_updates=13000
--train_warmup_init_lr=1e-7
--train_optimizer=nag
--train_lr=1
--train_lr_schedule=invsqrt --train_momentum=0.9 、 --train_weight_decay=0


- BERT-style training (`--train_task=mask`)

fl_lm_train
--exp_rundir=/tmp
--exp_model_name=my_lm
--data_dir=<...>/my_data
--data_valid=data1.txt,data2.txt
--data_sample_break_mode=none
--data_tokens_per_sample=4096
--dictionary=<...>/my_dict.txt
--loss_type=adsm
--loss_adsm_input_size=1024
--loss_adsm_cutoffs=10000,50000,150000
--train_task=mask --train_arch_file=/path/to/compiled/myarch.so
--train_max_grad_norm=0.1
--train_report_updates=1000
--train_save_updates=13000
--train_warmup_init_lr=1e-7
--train_optimizer=nag
--train_lr=1
--train_lr_schedule=invsqrt --train_momentum=0.9 、 --train_weight_decay=0 --mask_rand_token_prob=0.1
--mask_same_token_prob=0.1
--mask_prob=0.15


A complete list of the flag definitions and short descriptions of their meaning can be found [here](https://github.com/facebookresearch/flashlight/blob/master/flashlight/app/lm/Trainer.cpp).


## Evaluation

Coming soon

Name		Name	Last commit message	Last commit date
parent directory ..
common		common
data		data
plugins		plugins
test		test
BuildDictionary.cpp		BuildDictionary.cpp
CMakeLists.txt		CMakeLists.txt
README.md		README.md
Test.cpp		Test.cpp
Train.cpp		Train.cpp
Trainer.cpp		Trainer.cpp
Trainer.h		Trainer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Language Modeling

Build Dictionary

FilesExpand file tree

lm

Directory actions

More options

Directory actions

More options

Latest commit

History

lm

Folders and files

parent directory

README.md

Language Modeling

Build Dictionary