Issue with Tokenizer Vocabulary Size #13

pzhang84 · 2023-06-19T22:39:33Z

Hello,

I've been using your model and tokenizer, specifically the "lightonai/RITA_s" one, and I'm running into some issues with the tokenizer's vocabulary size. When I load the tokenizer and call tokenizer.vocab_size, it returns a vocabulary size of 1. However, when calling tokenizer.get_vocab(), it seems to have a full vocabulary of distinct tokens of 26.

Here is my code:
`from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("lightonai/RITA_s")
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")

print(f"Tokenizer's vocab size is {tokenizer.vocab_size}.")
print(tokenizer.get_vocab())
`

The output for the vocabulary size is 1, but the get_vocab() call returns a full dictionary of tokens. This inconsistency is causing issues for me when I'm trying to fine-tune the model on my dataset.

Could you please clarify whether this is an intended behavior or a possible bug? If it's intended, could you please provide some guidance on how to correctly fine-tune your model with a new dataset given this behavior of the tokenizer?

Thank you for your time and your contributions to the community.

Best regards,
Pengfei

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Tokenizer Vocabulary Size #13

Issue with Tokenizer Vocabulary Size #13

pzhang84 commented Jun 19, 2023

Issue with Tokenizer Vocabulary Size #13

Issue with Tokenizer Vocabulary Size #13

Comments

pzhang84 commented Jun 19, 2023