Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange tokenizer results with self-pretrained model #32

Open
lightercs opened this issue Sep 1, 2022 · 12 comments
Open

strange tokenizer results with self-pretrained model #32

lightercs opened this issue Sep 1, 2022 · 12 comments

Comments

@lightercs
Copy link

Hi, I trained a new vocab and bert model with my own datasets following your scripts, with the Mecab Dictionary being changed.
but when I exam it, quite strange results returned everytime. Would you please help me check on this and give me some advice?

Details as below:
My code:

from transformers import BertJapaneseTokenizer, BertForMaskedLM

model_name_or_path = "/content/drive/MyDrive/bert/new_bert/" 
tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path, mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"})

model = BertForMaskedLM.from_pretrained(model_name_or_path)
input_ids = tokenizer.encode(f"青葉山で{tokenizer.mask_token}の研究をしています。", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(input_ids[0].tolist()))

masked_index = torch.where(input_ids == tokenizer.mask_token_id)[1][0].tolist()
print(masked_index)

result = model(input_ids)
pred_ids = result[0][:, masked_index].topk(5).indices.tolist()[0]
for pred_id in pred_ids:
    output_ids = input_ids.tolist()[0]
    output_ids[masked_index] = pred_id
    print(tokenizer.decode(output_ids))

the result:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
Some weights of the model checkpoint at /content/drive/MyDrive/bert/new_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']
4
[CLS] 青葉山で ヒダ の研究をしています 。 [SEP]
[CLS] 青葉山で 宿つ の研究をしています 。 [SEP]
[CLS] 青葉山で 法外 の研究をしています 。 [SEP]
[CLS] 青葉山で 頑丈 の研究をしています 。 [SEP]
[CLS] 青葉山で弱 の研究をしています 。 [SEP]

the tokenize result is firstly quite odd as below, and then the predict results.

['[CLS]', '青葉', '##山', '##で', '[MASK]', 'の', '##研', '##究', '##を', '##して', '##いま', '##す', '。', '[SEP]']

but when I change to your pre-trained tokenizer bert-base-v2 (still use my model), the result changed alot.

Some weights of the model checkpoint at /content/drive/MyDrive/kindai_bert/kindai_bert/ were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'い', 'ます', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 旧来 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 で 生野 の 研究 を し て い ます 。 [SEP]
[CLS] 青葉 山 でד の 研究 を し て い ます 。 [SEP]

My local bert folder is like:
image

Thank you in advance.

@singletongue
Copy link
Collaborator

Hi, @lightercs.

Would you please show me the content of your config.json file?
There could be some misconfiguration on using the tokenizer.

Thank you.

@lightercs
Copy link
Author

@singletongue , thank you for your reply!
I remembered I did't change any paremeter just use your ones, and the config.json file is as below pic
image

Should I change which paramether when traning?
( Only datasets and Mecab dictionary is changed, since the vocab_size remains the same.

@singletongue
Copy link
Collaborator

OK then, what about initializing the tokenizer by the following line?

tokenizer = BertJapaneseTokenizer.from_pretrained(
    model_name_or_path,
    do_lower_case=False,
    word_tokenizer_type="mecab",
    subword_tokenizer_type="wordpiece",
    mecab_kwargs={"mecab_option": "-d /content/drive/MyDrive/UniDic"}
)

You may have to modify some of the values for your configuration.

Thank you.

@lightercs
Copy link
Author

@singletongue , thank you for the suggestion !
I modified the config you suggested. and it seems that the tokenizer.convert_ids_to_tokens(input_ids[0].tolist()) function behaves normal below, but the predict results still appears quite different from when applying cl-tohoku/bert-base-v2.

Local self-trained model, with the local moder's tokenizer:

['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'います', '。', '[SEP]']
4
[CLS] 青葉 山 で ヒダ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 宿つ の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 赤裸 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 石 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で迹 の 研究 を し て います 。 [SEP] 

Local self-trained model, with tohoku/bert-base-v2 :

['[CLS]', '青葉', '山', 'で', '[MASK]', 'の', '研究', 'を', 'し', 'て', 'いま', '##す', '。', '[SEP]']
4
[CLS] 青葉 山 で 宮司 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 飛翔 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で稽 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で IBM の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 高かっ の 研究 を し て います 。 [SEP]

Also interesting thing is that, I can understand that these two tokenizers can be exchanged to some degree, because their vocab size is the same, but when I try this pattern My local model + your bert-base-japanese(version 1st, with vocab-size :32000) , although the vocab size is unmatch, but the result below seems quite resonable and much better than the above two.

4
[CLS] 青葉 山 で ダイヤモンド の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 粘膜 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で蝶 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 師範 の 研究 を し て います 。 [SEP]
[CLS] 青葉 山 で 原動力 の 研究 を し て います 。 [SEP] ```

What's your opinions on this?  It remained me more one details that when training the tokenizers, in order to give a new dictionary address to MeCab, I specified the mecab dict to ``` "-d /content/drive/MyDrive/UniDic ```, and  added it as a new ```mecab_option = UniDic```.  Is there any relation?

Thank you.

@singletongue
Copy link
Collaborator

Could you show me the full command you executed when you trained the tokenizer?
(i.e., python train_tokenizer.py <all_the_options_you_specified>)

Thank you.

@lightercs
Copy link
Author

Hi @singletongue , sorry for missing your update and being response late.
I added os.environ["TOKENIZERS_PARALLELISM"] = "false" to the train_tokenizer.py script,
changed from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
to from transformers import BertJapaneseTokenizer due to import error, and below is my conmand :

python train_tokenizer.py --input_files=D:\Data\BERT\Data\total.txt --output_dir=D:\Data\BERT\Model\bert\Vocab\ --tokenizer_type=wordpiece --mecab_dic_type=unidic --vocab_size=32768 --limit_alphabet=6129 --num_unused_tokens=10

Thank you greatly!

@singletongue
Copy link
Collaborator

Thank you for the information. I understand that, when training the tokenizer, you did not specify the mecab path (/content/drive/MyDrive/UniDic) which you first mentioned.
Then, would you give it a try initializing the tokenizer by the following line?

tokenizer = BertJapaneseTokenizer.from_pretrained(
    model_name_or_path,
    do_lower_case=False,
    word_tokenizer_type="mecab",
    subword_tokenizer_type="wordpiece",
    mecab_kwargs={"mecab_dic": "unidic"},
}

@lightercs
Copy link
Author

Hi @singletongue, Thank you for your reply.
Sorry for my poor explaination. I actually specified the new mecab path as a new meacb Dic("unidic") in the pre-tokenizers.py file and some other places related with mecab options. The code is as show in the pic.
image

And I tried to change the mecab_dic to unidic when eaxming, ValueError: Invalid mecab_dic is specified. returned.

Is it that because I changed the mecab_optionand mecab_dic_typein the local pre-tokenizers.py files when training the tokenizers, so although I changed mecab_kwargs and specified the new Dic path, the Tokenizer still can not be the same with the situation I used to training it, becasue in the Transformers library, BertJapaneseTokenizer.from_pretrained method remains the same when I exam in Colab ?

If so, how should I do if I want to training a new Vocab with a new Dict when training, and use it afterwards?

Thanks in advance.

@singletongue
Copy link
Collaborator

Yes, it seems that the inconsistent configuration of the tokenizers between your modified version and our (huggingface's) one is causing the problem.

Could you show me the full traceback you get when the ValueError: Invalid mecab_dic is specified is raised?
And, could you specify what version of transformers library you're using?

@lightercs
Copy link
Author

@singletongue , thank you greatly for your reply!
Then should I change the transformers related dist-packages on Colab, like tokenizer.py file just as I used to do when local training ?

The transformers library I'm using when examing is transformers==4.18.0.
and the traceback code when error I pasted below, (since if I name the new mecab_dic to unidic, it will be duplicated with another library, I replaced it with Unidic. )

The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-15-e6e9378d9665>](https://localhost:8080/#) in <module>
      7     word_tokenizer_type="mecab",
      8     subword_tokenizer_type="wordpiece",
----> 9     mecab_kwargs={"mecab_dic":"Unidic"}
     10 )
     11 

3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1785             use_auth_token=use_auth_token,
   1786             cache_dir=cache_dir,
-> 1787             **kwargs,
   1788         )
   1789 

[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1913         # Instantiate tokenizer.
   1914         try:
-> 1915             tokenizer = cls(*init_inputs, **init_kwargs)
   1916         except OSError:
   1917             raise OSError(

[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)
    150             elif word_tokenizer_type == "mecab":
    151                 self.word_tokenizer = MecabTokenizer(
--> 152                     do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
    153                 )
    154             else:

[/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py](https://localhost:8080/#) in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)
    278 
    279             else:
--> 280                 raise ValueError("Invalid mecab_dic is specified.")
    281 
    282             mecabrc = os.path.join(dic_dir, "mecabrc")

ValueError: Invalid mecab_dic is specified. 

Should I change the tokenizers and transformers library to this ? -->tokenizers==0.9.2 transformers==3.4.0 , I tried it and the error traceback as below:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'BertJapaneseTokenizer'.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-23-e6e9378d9665>](https://localhost:8080/#) in <module>
      7     word_tokenizer_type="mecab",
      8     subword_tokenizer_type="wordpiece",
----> 9     mecab_kwargs={"mecab_dic":"Unidic"}
     10 )
     11 

3 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1785             return obj
   1786 
-> 1787         # add_type_field=True to allow dicts in the kwargs / differentiate from AddedToken serialization
   1788         tokenizer_config = convert_added_tokens(tokenizer_config, add_type_field=True)
   1789         with open(tokenizer_config_file, "w", encoding="utf-8") as f:

[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, cache_dir, *init_inputs, **kwargs)
   1913         """
   1914         Find the correct padding/truncation strategy with backward compatibility
-> 1915         for old arguments (truncation_strategy and pad_to_max_length) and behaviors.
   1916         """
   1917         old_truncation_strategy = kwargs.pop("truncation_strategy", "do_not_truncate")

/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)

/usr/local/lib/python3.7/dist-packages/transformers/models/bert_japanese/tokenization_bert_japanese.py in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)

ValueError: Invalid mecab_dic is specified.

Thank you for your time.

@singletongue
Copy link
Collaborator

Thank you for the information, @lightercs.
Yes, it seems that you should use your custom tokenization files as you did when you performed training, since the transformers library does not know anything about your customization of the tokenization scripts.

@lightercs
Copy link
Author

lightercs commented Oct 7, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants