abut train corpus format #10

Rooders · 2020-08-26T09:40:12Z

hello~
When I use this code to training a model, What format should be processed for the source corpus, the target corpus, the context corpus? are they tokenized and BPE? Could u send me a demo about it?
Thank u very mach.

Glaceon31 · 2020-08-28T02:56:31Z

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

Rooders · 2020-08-30T03:15:26Z

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?

Rooders · 2020-08-30T04:33:17Z

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?

My e-mail : [email protected]
Thank you very mach

Glaceon31 · 2020-09-02T09:35:18Z

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?

It seems that you only use one reference in validation when NIST test sets have 4 references. Using more references will result in higher BLEU scores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

abut train corpus format #10

abut train corpus format #10

Rooders commented Aug 26, 2020

Glaceon31 commented Aug 28, 2020

Rooders commented Aug 30, 2020

Rooders commented Aug 30, 2020

Glaceon31 commented Sep 2, 2020

abut train corpus format #10

abut train corpus format #10

Comments

Rooders commented Aug 26, 2020

Glaceon31 commented Aug 28, 2020

Rooders commented Aug 30, 2020

Rooders commented Aug 30, 2020

Glaceon31 commented Sep 2, 2020