data

Oct 31, 2019

736f123 · Oct 31, 2019

This branch is 1 commit ahead of, 3 commits behind EdinburghNLP/nematus:master.

Name	Name	Last commit message	Last commit date
parent directory ..
nonbreaking_prefixes	nonbreaking_prefixes	tokenizer from mosesdecoder	Jul 3, 2015
README.md	README.md	merge in multi-bleu-detok.perl from theano branch	Jun 27, 2018
build_dictionary.py	build_dictionary.py	build_dictionary.py: assume input files are UTF-8	Jul 26, 2019
length.py	length.py	Convert code to Python 3	Dec 13, 2018
merge.sh	merge.sh	scripts added	Dec 5, 2015
multi-bleu-detok.perl	multi-bleu-detok.perl	sync in fix from moses scripts	Dec 12, 2018
multi-bleu.perl	multi-bleu.perl	make executable	Apr 22, 2019
postprocess.sh	postprocess.sh	cleanup and documentation	Apr 26, 2016
preprocess.sh	preprocess.sh	update shebang and documentation to python3	Mar 14, 2019
shuffle.py	shuffle.py	Make corpus shuffling less memory hungry	Oct 31, 2019
strip_sgml.py	strip_sgml.py	Convert code to Python 3	Dec 13, 2018
tokenizer.perl	tokenizer.perl	tokenizer from mosesdecoder	Jul 3, 2015

README.md

This directory contains small scripts for data processing and evaluation. Other useful scripts and sample data is provided at https://github.com/rsennrich/wmt16-scripts

This directory contains two evaluation scripts:

multi-bleu.perl (from Moses decoder) computes tokenized, case-sensitive BLEU scores. This script is widely used in NMT research, but we discourage its use for publication because different groups use different tokenization, which biases comparisons to previous work.

usage: ./multi-bleu.perl ref_file < test_file
multi-bleu-detok.perl expects that the reference file and output file are not tokenized (untokenized reference; detokenized output). It performs tokenization internally, using the tokenization routine from the NIST BLEU scorer (mteval-v13a.pl). This script can be used as a plaintext alternative of mteval-v13a.pl, giving the same results.

usage: ./multi-bleu-detok.perl ref_file < test_file