How to use `PretrainedTransformerTokenizer` and `PretrainedTransformerIndexer` without the `PretrainedTransformerEmbedder` #5353

gabeorlanski · 2021-08-10T17:00:32Z

gabeorlanski
Aug 10, 2021

I am using the PretrainedTransformerTokenizer to tokenize my target sequences in a composed seq2seq model. I use the PretrainedTransformerEmbedder as the encoder and the AutoRegressiveDecoder for decoding. In the latter, there are target embeddings that I would like to keep as the basic text embeddings rather than use the PretrainedTransformerEmbedder. However, I also wanted to use the PretrainedTransformerIndexer to get access to the full vocabulary of the pretrained model.

However, from my attempts thus far, it seems that using the PretrainedTransformerIndexer does not initialize the vocabulary with its namespace until after the model has been created. For the normal embeddings, this has the issue of causing one of its dimensions always to be 2 as the vocabulary is empty. Is there an officially supported way to address this?

epwalsh · 2021-08-10T18:17:07Z

epwalsh
Aug 10, 2021
Maintainer

Hey @gabeorlanski, you could call Vocabulary.add_transformer_vocab manually at any point to add the transformer vocab under a specific namespace, or you could instantiate the Vocabulary object using Vocabulary.from_pretrained_transformer (use "type": "from_pretrained_transformer" in your config).

4 replies

gabeorlanski Aug 10, 2021
Author

I tried that, and unfortunately leads to a large number of OOV tokens, could it be the namespaces I am trying to use?

Here are the config snippets:

"dataset_reader": {
     "target_tokenizer": {
            "type": "basic",
            "underlying_tokenizer": {
                "type": "pretrained_transformer",
                "model_name": "prajjwal1/bert-tiny",
                "add_special_tokens": false
            },
            "tokenize_key": true
        },
        "target_indexer": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": "prajjwal1/bert-tiny",
                "max_length": 512,
                "namespace": "target_tokens"
            }
        }
},
"vocabulary": {
    "type": "from_pretrained_transformer",
    "model_name": "prajjwal1/bert-tiny",
    "namespace": "target_tokens"
}

Note that the target_tokenizer is a custom tokenizer class that has an underlying AllenNLP tokenizer.

epwalsh Aug 10, 2021
Maintainer

OOV tokens in your tokenized target sequences?

gabeorlanski Aug 10, 2021
Author

Yes, I do not know how because without the vocab or PretrainedIndexer, it works fine. With a single ID indexer.

gabeorlanski Aug 11, 2021
Author

@epwalsh So my fix, not a good one or one that is particularly nice, was to add:

for namespace in vocabulary_.get_namespaces():
    vocabulary_.add_tokens_to_namespace(
        getattr(dataset_reader, 'special_tokens', []),
        namespace
    )

to this spot in the train command and to add

"tokenizer_kwargs": {
    "additional_special_tokens": ["<t>", "</t>", "NULL_TOKEN", "<val>"]
}

to both the indexers and the tokenizers.

Unfortunately, because I am on windows, it is a lot of repetition because I do not have jsonnet. But maybe overall project configs can help that issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use `PretrainedTransformerTokenizer` and `PretrainedTransformerIndexer` without the `PretrainedTransformerEmbedder` #5353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to use PretrainedTransformerTokenizer and PretrainedTransformerIndexer without the PretrainedTransformerEmbedder #5353

gabeorlanski Aug 10, 2021

Replies: 1 comment · 4 replies

epwalsh Aug 10, 2021 Maintainer

gabeorlanski Aug 10, 2021 Author

epwalsh Aug 10, 2021 Maintainer

gabeorlanski Aug 10, 2021 Author

gabeorlanski Aug 11, 2021 Author

How to use `PretrainedTransformerTokenizer` and `PretrainedTransformerIndexer` without the `PretrainedTransformerEmbedder` #5353

gabeorlanski
Aug 10, 2021

Replies: 1 comment 4 replies

epwalsh
Aug 10, 2021
Maintainer

gabeorlanski Aug 10, 2021
Author

epwalsh Aug 10, 2021
Maintainer

gabeorlanski Aug 10, 2021
Author

gabeorlanski Aug 11, 2021
Author