Data for "On the Impact of Various Types of Noise on Neural Machine Translation"

This file contains the data used in "On the Impact of Various Types of Noise on Neural Machine Translation" by Huda Khayrallah and Philipp Koehn.

Data Format

All data has been tokenized using the following scripts from Moses (https://github.com/moses-smt/mosesdecoder). You only need to download it, you do not need to install moses to use the scripts):

moses-smt/mosesdecodertokenizer/normalize-punctuation.perl $lang | moses-smt/mosesdecoder/tokenizer/tokenizer.perl -a -l $lang

with $lang in {de,en}

For our baseline we use Europarl, News Commentary and the Rapid EU Press Release parallel corpus, all from the WMT 2017 shared task.

To create the data sets used in Table 9, concatenate baseline.tok.$lang with the noisy file of the desired amount: $noise_type.$amount.tok.$lang.

Noise type in {misaligned_sent, misordered_words_src, misordered_words_trg, wrong_lang_fr_src, wrong_lang_fr_trg, untranslated_en_src, untranslated_de_trg, short_max2, short_max5, raw_paracrawl}
Amount is a percentage in {05, 10, 20, 50, 100}.

Citations

If you use this data please cite our paper, and the original data sources as follows:

@inproceedings{khayrallah-koehn-2018-impact,
    title = "On the Impact of Various Types of Noise on Neural Machine Translation",
    author = "Khayrallah, Huda  and Koehn, Philipp",
    booktitle = "Proceedings of the 2nd Workshop on Neural Machine Translation and Generation",
    month = jul,
    year = "2018",
    address = "Melbourne, Australia",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W18-2709",
    doi = "10.18653/v1/W18-2709",
    pages = "74--83"}

Europarl (http://www.statmt.org/europarl/):

@InProceedings{Koehn:2005:MTS,
  url = {http://mt-archive.info/MTS-2005-Koehn.pdf},
  googlescholar = {6985235632472432229},
  author    = {Philipp Koehn},
  title     = {Europarl: A Parallel Corpus for Statistical Machine Translation},
  booktitle = {Proceedings of the Tenth Machine Translation Summit (MT Summit X)},
  month     = {September},
  year      = {2005},
  address   = {Phuket, Thailand},
}

News Commentary: (http://www.casmacat.eu/corpus/news-commentary.html)

WMT shared task (http://statmt.org/wmt17/):

@inproceedings{bojar-etal-2017-findings,
    title = "Findings of the 2017 Conference on Machine Translation ({WMT}17)",
    author = "Bojar, Ond{\v{r}}ej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huang, Shujian  and
      Huck, Matthias  and
      Koehn, Philipp  and
      Liu, Qun  and
      Logacheva, Varvara  and
      Monz, Christof  and
      Negri, Matteo  and
      Post, Matt  and
      Rubino, Raphael  and
      Specia, Lucia  and
      Turchi, Marco",
    booktitle = "Proceedings of the Second Conference on Machine Translation",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W17-4717",
    doi = "10.18653/v1/W17-4717",
    pages = "169--214",
}