The reason why the max_length of KLUE RoBERTa is 510

Issue description

Unlike roberta, max_length of our klue/roberta is 510, not 512.

Why did it happen?

1. According to a `roberta` pertraining guideline of `fairseq`, a suggested value of `max_positions` is 512.

TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

2. A default `padding_idx` is 1, unless it is predefined in vocab

class Dictionary:
    """A mapping from symbols to consecutive integers"""

    def __init__(
        self,
        *,  # begin keyword-only arguments
        bos="<s>",
        pad="<pad>",
        eos="</s>",
        unk="<unk>",
        extra_special_symbols=None,
    ):
        self.bos_word, self.unk_word, self.pad_word, self.eos_word = bos, unk, pad, eos
        self.symbols = []
        self.count = []
        self.indices = {}
        self.bos_index = self.add_symbol(bos)  # idx=0
        self.pad_index = self.add_symbol(pad)  # idx=1
        self.eos_index = self.add_symbol(eos)
        self.unk_index = self.add_symbol(unk)
        if extra_special_symbols:
            for s in extra_special_symbols:
                self.add_symbol(s)
        self.nspecial = len(self.symbols)

3. Based on the `fairseq` implementation, position embedding ids start from `padding_idx` + 1

def make_positions(tensor, padding_idx, onnx_trace=False):
    """Replace non-padding symbols with their position numbers.
    Position numbers begin at padding_idx+1. Padding symbols are ignored.
    """
    # The series of casts and type-conversions here are carefully
    # balanced to both work with ONNX export and XLA. In particular XLA
    # prefers ints, cumsum defaults to output longs, and ONNX doesn't know
    # how to handle the dtype kwarg in cumsum.
    mask = tensor.ne(padding_idx).int()
    return (
        torch.cumsum(mask, dim=1).type_as(mask) * mask
    ).long() + padding_idx

position_id is set to 2 (padding_idx + 1) ~ 512 (max_positions)
which implies, max_length=510
The (English) roberta has no such issue because max_positions_embeddings was set to 514
- roberta-base/config.json

{ 
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,  # max_length = 512
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

- klue/roberta-base/config.json

{
  "architectures": ["RobertaForMaskedLM"],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 512,  # max_length = 510
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32000,
  "tokenizer_class": "BertTokenizer"
}

4. `huggingface` implementation makes position embedding ids (`position_id`) ranging from `padding_idx` + 1, following `fairseq`, to `padding_idx` + `max_length` + 1

Since huggingface build position embeddings using max_seq_length, we have to use max_seq_length=510 to avoid indexerror.

def create_position_ids_from_inputs_embeds(self, inputs_embeds):
    """
    We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
    Args:
        inputs_embeds: torch.Tensor
    Returns: torch.Tensor
    """
    input_shape = inputs_embeds.size()[:-1]
    sequence_length = input_shape[1]

    position_ids = torch.arange(
        self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device
    )
    return position_ids.unsqueeze(0).expand(input_shape)

Dataset Description

KLUE-RoBERTa Issue

special_token_id

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reason why the max_length of KLUE RoBERTa is 510

Issue description

Why did it happen?

1. According to a `roberta` pertraining guideline of `fairseq`, a suggested value of `max_positions` is 512.

2. A default `padding_idx` is 1, unless it is predefined in vocab

3. Based on the `fairseq` implementation, position embedding ids start from `padding_idx` + 1

4. `huggingface` implementation makes position embedding ids (`position_id`) ranging from `padding_idx` + 1, following `fairseq`, to `padding_idx` + `max_length` + 1

Dataset Description

KLUE-RoBERTa Issue

Clone this wiki locally

The reason why the max_length of KLUE RoBERTa is 510

Issue description

Why did it happen?

1. According to a roberta pertraining guideline of fairseq, a suggested value of max_positions is 512.

2. A default padding_idx is 1, unless it is predefined in vocab

3. Based on the fairseq implementation, position embedding ids start from padding_idx + 1

4. huggingface implementation makes position embedding ids (position_id) ranging from padding_idx + 1, following fairseq, to padding_idx + max_length + 1

Dataset Description

KLUE-RoBERTa Issue

Clone this wiki locally

1. According to a `roberta` pertraining guideline of `fairseq`, a suggested value of `max_positions` is 512.

2. A default `padding_idx` is 1, unless it is predefined in vocab

3. Based on the `fairseq` implementation, position embedding ids start from `padding_idx` + 1

4. `huggingface` implementation makes position embedding ids (`position_id`) ranging from `padding_idx` + 1, following `fairseq`, to `padding_idx` + `max_length` + 1