Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

Open
suchunxie opened this issue Apr 6, 2022 · 2 comments

Comments

@suchunxie
Copy link

Hi Team,

I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".

here's something I tired:

  1. pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.
  2. pass a list of filenames like
    ['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt']
    or single sentence list, the same error also occur.

Can you give any advise on this situation? Thanks a lot.

@singletongue
Copy link
Collaborator

Hi, @suchunxie.

It seems that you opened a issue in the huggingface's tokenizer repository and you have successfully resolved the issue. If you still have problems, please inform us again.

Thank you!

@suchunxie
Copy link
Author

suchunxie commented Apr 12, 2022

Hi @singletongue,

Thanks for your reply.
Yeah I used another tokenizer training method so the input_file parameter works( but I'm not sure that works properly).
The method that based on your method also worked after some efforts, but I can only make it to pass one file name, if I pass a list of filename, error occurs again, is there anyway that I can pass several corpus files to the input_file parameter?

Besides, there's also some questions about the settings of your model. I'm quite a newbie it will be hughly appreciated and a great help if you are kindly give some advices on it.

  • Should the data do normalization before training the tokenizer? I tested do and not, and found that there's some differences on the result vocab file, about 150 words changed, including the basic 26 alphabet in English, like 'j', and 'z' was missed in the normalized way, can I add them into it manually?
  • How should I determine the vocab_size and what's the use of the totokens. do I need to change other training files when do model training if I change the vocab_size? or keep the vocab_size the same with you will be better.
  • About the training data of BERT model, I now only have 600MBs data, will this be far away enough to train a practical BERT?
    How many data you think can asure a workable BERT.

Yours sincerely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants