You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".
here's something I tired:
pass a sigle filename or content of a single file (withine the same folder of the train_tokenizers.py file), the error appears.
pass a list of filenames like
['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt']
or single sentence list, the same error also occur.
Can you give any advise on this situation? Thanks a lot.
The text was updated successfully, but these errors were encountered:
It seems that you opened a issue in the huggingface's tokenizer repository and you have successfully resolved the issue. If you still have problems, please inform us again.
Thanks for your reply.
Yeah I used another tokenizer training method so the input_file parameter works( but I'm not sure that works properly).
The method that based on your method also worked after some efforts, but I can only make it to pass one file name, if I pass a list of filename, error occurs again, is there anyway that I can pass several corpus files to the input_file parameter?
Besides, there's also some questions about the settings of your model. I'm quite a newbie it will be hughly appreciated and a great help if you are kindly give some advices on it.
Should the data do normalization before training the tokenizer? I tested do and not, and found that there's some differences on the result vocab file, about 150 words changed, including the basic 26 alphabet in English, like 'j', and 'z' was missed in the normalized way, can I add them into it manually?
How should I determine the vocab_size and what's the use of the totokens. do I need to change other training files when do model training if I change the vocab_size? or keep the vocab_size the same with you will be better.
About the training data of BERT model, I now only have 600MBs data, will this be far away enough to train a practical BERT?
How many data you think can asure a workable BERT.
Hi Team,
I'm tring to train a Japanese Bert with my own data based on yours, and did't modify the structures.
but when I pass the train data path to train a tokenizer, every time there go's wrong,
the error is "Can't convert ['test.txt'] to Trainer".
here's something I tired:
['data_file/0.txt', 'data_file/1.txt', 'data_file/2.txt', 'data_file/3.txt', 'data_file/4.txt']
or single sentence list, the same error also occur.
Can you give any advise on this situation? Thanks a lot.
The text was updated successfully, but these errors were encountered: