'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

suchunxie · 2022-04-06T03:50:50Z

Hi Team,

I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".

here's something I tired:

pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.
pass a list of filenames like
['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt']
or single sentence list, the same error also occur.

Can you give any advise on this situation? Thanks a lot.

singletongue · 2022-04-11T02:10:14Z

Hi, @suchunxie.

It seems that you opened a issue in the huggingface's tokenizer repository and you have successfully resolved the issue. If you still have problems, please inform us again.

Thank you!

suchunxie · 2022-04-12T00:57:52Z

Hi @singletongue,

Thanks for your reply.
Yeah I used another tokenizer training method so the input_file parameter works( but I'm not sure that works properly).
The method that based on your method also worked after some efforts, but I can only make it to pass one file name, if I pass a list of filename, error occurs again, is there anyway that I can pass several corpus files to the input_file parameter?

Besides, there's also some questions about the settings of your model. I'm quite a newbie it will be hughly appreciated and a great help if you are kindly give some advices on it.

Should the data do normalization before training the tokenizer? I tested do and not, and found that there's some differences on the result vocab file, about 150 words changed, including the basic 26 alphabet in English, like 'j', and 'z' was missed in the normalized way, can I add them into it manually?
How should I determine the vocab_size and what's the use of the totokens. do I need to change other training files when do model training if I change the vocab_size? or keep the vocab_size the same with you will be better.
About the training data of BERT model, I now only have 600MBs data, will this be far away enough to train a practical BERT?
How many data you think can asure a workable BERT.

Yours sincerely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

suchunxie commented Apr 6, 2022

singletongue commented Apr 11, 2022

suchunxie commented Apr 12, 2022 •

edited

Loading

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

'Can't convert ['test.txt'] to Trainer' when training a BertWordPieceTokenizer #30

Comments

suchunxie commented Apr 6, 2022

singletongue commented Apr 11, 2022

suchunxie commented Apr 12, 2022 • edited Loading

suchunxie commented Apr 12, 2022 •

edited

Loading