Directory Title

The University of Edinburgh's WMT19 systems

This directory contains the GU-EN and EN-GU NMT models built for the WMT19 shared news translation task.

Name	Last modified	Size

Parent Directory		-
en-gu/	2019-06-26 10:09	-
gu-en/	2019-06-26 10:10	-
preproc_models/	2019-06-25 09:52	-
scripts/	2019-10-15 15:01	-
zh-en/	2020-04-23 22:48	-

Requirements

The models use the following software:

moses decoder (scripts only; no compilation required) https://github.com/moses-smt/mosesdecoder
fastBPE https://github.com/glample/fastBPE.git
marian https://github.com/marian-nmt/marian
XLM (for gu-en and en-gu training of backtranslation models only)
IndicNLP library (for Gujarati pre- and post-processing) https://github.com/anoopkunchukuttan/indic_nlp_library.git

EN-GU and GU-EN models

The models for each language direction are found in the corresponding folders (e.g. en-gu/). Each folder also contains a script to run and train models.

Please note - due to a preprocessing error fixed after the shared task, these models are retrained versions of the ones described in our paper (see citation below). Consequently, the results of these new models are higher than our official results in the task. New BLEU scores (calculated using cased SacreBLEU) are as follows:

Dataset	en-gu		gu-en
Dataset	Single	Ensemble	Single	Ensemble
newsdev2019	16.4	17.3	27.6	28.6
newstest2019	15.3	16.3	21.1	22.3

All translations are produced with a beam size of 6 and ensembles are unweighted. See translation scripts for more details.

Data preprocessing and postprocessing

Data preprocessing and postprocessing are specific to the language pair. Preprocessing models (e.g. truecasing and BPE models) are found in preproc_models. Paths to preprocessing tools are specified in scripts/vars. Please changes these to your own installation paths.

preprocessing: scripts/preprocess-{en,gu}.sh INPUT OUTPUT
postprocessing: scripts/postprocess-{en,gu}.sh INPUT OUTPUT

Translating

Translate with the best single model: bash {en-gu,gu-en}/translate-single.sh INPUT GPU(S) > OUTPUT
Translate with an ensemble of 4 models: bash {en-gu,gu-en}/translate-ensemble.sh INPUT GPU(S) > OUTPUT

N.B the input should be preprocessed as specified above and the output should be manually postprocessed once translated.

Training

Training scripts are found in each folder en-gu/ and gu-en/

There are 2 training steps:

Training on all data (genuine parallel, backtranslated monolingual, translated Hindi-English): bash {en-gu,gu-en}/train.sh
Fine-training on genuine parallel data: bash {en-gu,gu-en}/fine-train.sh

The seed variable indicates the model number and should be changed to train new models.

License

The use of the models provided in this directory is permitted under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (CC BY-NC-SA 3.0): https://creativecommons.org/licenses/by-nc-sa/3.0/

Attribution - You must give appropriate credit [please use the citation below], provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial - You may not use the material for commercial purposes.

ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Citation

The models are described in the following publication:

Rachel Bawden, Nikolay Bogoychev, Ulrich Germann, Roman Grundkiewicz, Faheem Kirefu, Antonio Valerio Miceli Barone and Alexandra Birch. The University of Edinburgh’s Submissions to the WMT19 News Translation Task. 2019. In Proceedings of the 4th Conference on Machine Translation. WMT'19. Florence, Italy.

@inproceedings{uedin-wmt19,
    title = {{The University of Edinburgh's Submissions to the WMT19 News Translation Task}},
    author = {Bawden, Rachel and Bogoychev, Nikolay and Germann, Ulrich and
             Grundkiewicz, Roman and Kirefu, Faheem and
         Valerio Miceli Barone, Antonio and Birch, Alexandra}, 
    booktitle = {{Proceedings of the 4th Conference on Machine Translation}},
    series = {{WMT'19}},
    address = {Florence, Italy},
    year = {2019}
}