-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strange tokenizer results with self-pretrained model #32
Comments
Hi, @lightercs. Would you please show me the content of your Thank you. |
@singletongue , thank you for your reply! Should I change which paramether when traning? |
OK then, what about initializing the tokenizer by the following line? tokenizer = BertJapaneseTokenizer.from_pretrained(
model_name_or_path,
do_lower_case=False,
word_tokenizer_type="mecab",
subword_tokenizer_type="wordpiece",
mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"}
) You may have to modify some of the values for your configuration. Thank you. |
@singletongue , thank you for the suggestion ! Local self-trained model, with the local moder's tokenizer:
Local self-trained model, with tohoku/bert-base-v2 :
Also interesting thing is that, I can understand that these two tokenizers can be exchanged to some degree, because their vocab size is the same, but when I try this pattern
|
Could you show me the full command you executed when you trained the tokenizer? Thank you. |
Hi @singletongue , sorry for missing your update and being response late.
Thank you greatly! |
Thank you for the information. I understand that, when training the tokenizer, you did not specify the mecab path ( tokenizer = BertJapaneseTokenizer.from_pretrained(
model_name_or_path,
do_lower_case=False,
word_tokenizer_type="mecab",
subword_tokenizer_type="wordpiece",
mecab_kwargs={"mecab_dic": "unidic"},
} |
Hi @singletongue, Thank you for your reply. And I tried to change the Is it that because I changed the If so, how should I do if I want to training a new Vocab with a new Dict when training, and use it afterwards? Thanks in advance. |
Yes, it seems that the inconsistent configuration of the tokenizers between your modified version and our (huggingface's) one is causing the problem. Could you show me the full traceback you get when the |
@singletongue , thank you greatly for your reply! The
Should I change the tokenizers and transformers library to this ? -->
Thank you for your time. |
Thank you for the information, @lightercs. |
Hi, @Masatoshi Suzuki ***@***.***>
I tried to change the ```mecab_dic``` options on Colab's ```
transformers/tokenization_utils_base.py]``` Script since your last advice,
but it seems transformer still cannot accept the new mecab_dic option I
added.
Am I changed the wrong place? If so, how's the right way to specify a new
mecab_dic when training tokenizers, and use it after bert model
trained.(Btw, I tried to contact you by Email few days ago, but your
Tohoku-university email seems no longer available :( .
|
Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed.
but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?
Details as below:
My code:
the result:
the tokenize result is firstly quite odd as below, and then the predict results.
but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.
My local bert folder is like:
![image](https://user-images.githubusercontent.com/107525324/187857391-d64267d4-5f3f-4b9c-b9dc-b63e47f3fc16.png)
Thank you in advance.
The text was updated successfully, but these errors were encountered: