Replies: 1 comment 4 replies
-
An alternative solution to filling zero or -inf could be to duplicate the vector from the output row of The consequence would be that the output logits for |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @danbev , thanks for the very detailed notes.
Re. your problem about n_vocab, what I observed is that the shape of tensors are:
language_model.model.embed_tokens.weight [128 264, 4 096]
language_model.lm_head.weight [128 256, 4 096]
So as you found out, the output tensor has 8 less tokens than the embd tensor.
However, instead of modifying internal llama.cpp to handle this exception, I propose that upon converting safetensors to GGUF, we simply extend the lm_head to add these tokens. We can set these parameters to 0 (or
-inf
? I'm not sure yet) to make its logits always be small:Beta Was this translation helpful? Give feedback.
All reactions