You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been using your model and tokenizer, specifically the "lightonai/RITA_s" one, and I'm running into some issues with the tokenizer's vocabulary size. When I load the tokenizer and call tokenizer.vocab_size, it returns a vocabulary size of 1. However, when calling tokenizer.get_vocab(), it seems to have a full vocabulary of distinct tokens of 26.
Here is my code:
`from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("lightonai/RITA_s")
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
print(f"Tokenizer's vocab size is {tokenizer.vocab_size}.")
print(tokenizer.get_vocab())
`
The output for the vocabulary size is 1, but the get_vocab() call returns a full dictionary of tokens. This inconsistency is causing issues for me when I'm trying to fine-tune the model on my dataset.
Could you please clarify whether this is an intended behavior or a possible bug? If it's intended, could you please provide some guidance on how to correctly fine-tune your model with a new dataset given this behavior of the tokenizer?
Thank you for your time and your contributions to the community.
Best regards,
Pengfei
The text was updated successfully, but these errors were encountered:
Hello,
I've been using your model and tokenizer, specifically the "lightonai/RITA_s" one, and I'm running into some issues with the tokenizer's vocabulary size. When I load the tokenizer and call tokenizer.vocab_size, it returns a vocabulary size of 1. However, when calling tokenizer.get_vocab(), it seems to have a full vocabulary of distinct tokens of 26.
Here is my code:
`from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("lightonai/RITA_s")
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
print(f"Tokenizer's vocab size is {tokenizer.vocab_size}.")
print(tokenizer.get_vocab())
`
The output for the vocabulary size is 1, but the get_vocab() call returns a full dictionary of tokens. This inconsistency is causing issues for me when I'm trying to fine-tune the model on my dataset.
Could you please clarify whether this is an intended behavior or a possible bug? If it's intended, could you please provide some guidance on how to correctly fine-tune your model with a new dataset given this behavior of the tokenizer?
Thank you for your time and your contributions to the community.
Best regards,
Pengfei
The text was updated successfully, but these errors were encountered: