scripts

Translation Engine Evaluation Scripts

Machine provides tools for building a machine translation engine from a parallel corpus. The resulting engine can be used to generate predictive translation suggestions. The scripts in this folder can be used to automate the evaluation of the translation engine for a USX-based parallel corpus. The scripts are designed to run in a Linux environment. The evaluation process is divided into three steps: setup, training, and testing. The scripts were used to generate the data for the CAT 2017 Preliminary Engine Evaluation paper.

Setup

All training data and results are contained in a single root data directory. The directory contains multiple subdirectories:

corpora
engines
output

The corpora directory contains a subdirectory for each available corpus. Each corpus folder should contain USX files with the standard Scripture book number and id, e.g. 042LUK.usx. This is the only directory that needs to be created and populated prior to performing training and testing.

The engines directory contains the trained translation engines generated during the training step. Each subdirectory corresponds to a trained parallel corpus.

The output directory contains the results from the training and testing steps.

The scripts require that the .NET Core 2.1 SDK and GNU Parallel are installed. Instructions to download and install the .NET Core SDK can be found here. The latest version of GNU Parallel is needed and can be downloaded here. There is no official package for Ubuntu, so the source must be downloaded, built, and installed manually. After downloading the package, unzip the source into a directory and run the following commands:

./configure
make
sudo make install

Lastly, the scripts must be executed from this repo, so the repo should be cloned locally.

Training

The translation engine must be built from a parallel corpus using the train.sh script. The train.sh script trains multiple engines with increasing amounts of data. The engines are generated in parallel. A source and target corpus must be specified. The corpora should already exist in the corpora directory of the data root directory. Data should be excluded from the training set, so that it can be used during testing. This can be done by specifying a book to exclude from training. To perform training, execute the following command from the scripts directory of this repo:

./train.sh -s <source_corpus> -t <target_corpus> -e <book_id_to_exclude> -i <size_interval> -r <root_data_dir>

Here is an example:

./train.sh -s spanish -t english -e JHN -i 300 -r ../data

The generated engines are placed in the engines/<source_corpus>-<target_corpus>/<training_size>/ directory of the root data directory, e.g. engines/spanish-english/300/. The output statistics are placed in the output/<source_corpus>-<target_corpus>/<training_size>/train/ directory, e.g. output/spanish-english/300/train/. Traning generates the following statistics for each engine:

Number of segments trained
Language model perplexity
BLEU calculated during tuning

Testing

Once the engines are generated from the training data, the engines can be tested using the test.sh script. The test.sh script simulates a translator who is trying to produce the target verses in the test dataset. As with the training step, a source and target corpus must be specified. The testing data must also be specified. To perform testing, execute the following command from the scripts directory of this repo:

./test.sh -s <source_corpus> -t <target_corpus> -i <book_id_to_include> -c <confidence> -r <root_data_dir>

Here is an example:

./test.sh -s spanish -t english -i JHN -c 0.2,0.4 -r ../data

The output statistics are placed in the output/<source_corpus>-<target_corpus>/<training_size>/test/, e.g. output/spanish-english/300/test/. Testing generates the following statistics for each engine:

Number of segments translated
Number of suggestions
Number of accepted suggestions
Percentage of each accepted suggestion type (complete, initial, final, medial)
Keystroke and mouse-action ratio (KSMR)
Suggestion precision

For more information on these statistics, see the CAT 2017 Preliminary Engine Evaluation paper

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
data_scraper.py		data_scraper.py
push-local-feed.ps1		push-local-feed.ps1
test.sh		test.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Translation Engine Evaluation Scripts

Setup

Training

Testing

Uh oh!

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

Translation Engine Evaluation Scripts

Setup

Training

Testing