This file contains the data used in "On the Impact of Various Types of Noise on Neural Machine Translation" by Huda Khayrallah and Philipp Koehn.
All data has been tokenized using the following scripts from Moses (https://github.com/moses-smt/mosesdecoder). You only need to download it, you do not need to install moses to use the scripts):
moses-smt/mosesdecodertokenizer/normalize-punctuation.perl $lang | moses-smt/mosesdecoder/tokenizer/tokenizer.perl -a -l $lang
with $lang in {de,en}
For our baseline we use Europarl, News Commentary and the Rapid EU Press Release parallel corpus, all from the WMT 2017 shared task.
To create the data sets used in Table 9, concatenate baseline.tok.$lang with the noisy file of the desired amount: $noise_type.$amount.tok.$lang.
{misaligned_sent, misordered_words_src, misordered_words_trg, wrong_lang_fr_src, wrong_lang_fr_trg, untranslated_en_src, untranslated_de_trg, short_max2, short_max5, raw_paracrawl}{05, 10, 20, 50, 100}.If you use this data please cite our paper, and the original data sources as follows:
@inproceedings{khayrallah-koehn-2018-impact,
title = "On the Impact of Various Types of Noise on Neural Machine Translation",
author = "Khayrallah, Huda and Koehn, Philipp",
booktitle = "Proceedings of the 2nd Workshop on Neural Machine Translation and Generation",
month = jul,
year = "2018",
address = "Melbourne, Australia",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W18-2709",
doi = "10.18653/v1/W18-2709",
pages = "74--83"}
Europarl (http://www.statmt.org/europarl/):
@InProceedings{Koehn:2005:MTS,
url = {http://mt-archive.info/MTS-2005-Koehn.pdf},
googlescholar = {6985235632472432229},
author = {Philipp Koehn},
title = {Europarl: A Parallel Corpus for Statistical Machine Translation},
booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
month = {September},
year = {2005},
address = {Phuket, Thailand},
}
News Commentary: (http://www.casmacat.eu/corpus/news-commentary.html)
WMT shared task (http://statmt.org/wmt17/):
@inproceedings{bojar-etal-2017-findings,
title = "Findings of the 2017 Conference on Machine Translation ({WMT}17)",
author = "Bojar, Ond{\v{r}}ej and
Chatterjee, Rajen and
Federmann, Christian and
Graham, Yvette and
Haddow, Barry and
Huang, Shujian and
Huck, Matthias and
Koehn, Philipp and
Liu, Qun and
Logacheva, Varvara and
Monz, Christof and
Negri, Matteo and
Post, Matt and
Rubino, Raphael and
Specia, Lucia and
Turchi, Marco",
booktitle = "Proceedings of the Second Conference on Machine Translation",
month = sep,
year = "2017",
address = "Copenhagen, Denmark",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W17-4717",
doi = "10.18653/v1/W17-4717",
pages = "169--214",
}