Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abut train corpus format #10

Open
Rooders opened this issue Aug 26, 2020 · 4 comments
Open

abut train corpus format #10

Rooders opened this issue Aug 26, 2020 · 4 comments

Comments

@Rooders
Copy link

Rooders commented Aug 26, 2020

hello~
When I use this code to training a model, What format should be processed for the source corpus, the target corpus, the context corpus? are they tokenized and BPE? Could u send me a demo about it?
Thank u very mach.

@Glaceon31
Copy link
Collaborator

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

@Rooders
Copy link
Author

Rooders commented Aug 30, 2020

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
image

@Rooders
Copy link
Author

Rooders commented Aug 30, 2020

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
image

My e-mail : [email protected]
Thank you very mach

@Glaceon31
Copy link
Collaborator

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
image

It seems that you only use one reference in validation when NIST test sets have 4 references. Using more references will result in higher BLEU scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants