Discussion about llama 3.2 vision #8

ngxson · 2025-01-23T10:24:25Z

ngxson
Jan 23, 2025

Hi @danbev , thanks for the very detailed notes.

Re. your problem about n_vocab, what I observed is that the shape of tensors are:

language_model.model.embed_tokens.weight [128 264, 4 096]
language_model.lm_head.weight [128 256, 4 096]

So as you found out, the output tensor has 8 less tokens than the embd tensor.

However, instead of modifying internal llama.cpp to handle this exception, I propose that upon converting safetensors to GGUF, we simply extend the lm_head to add these tokens. We can set these parameters to 0 (or -inf? I'm not sure yet) to make its logits always be small:

# adding 8 rows to output tensor
added_rows = torch.full((8, 4096), float(0.0))
output_extended = torch.cat((output, added_rows), dim=0)

ngxson · 2025-01-23T10:30:01Z

ngxson
Jan 23, 2025
Author

An alternative solution to filling zero or -inf could be to duplicate the vector from the output row of <|end_of_text|> token.

The consequence would be that the output logits for <|image|> will be exactly the same with <|end_of_text|>

4 replies

danbev Jan 23, 2025
Maintainer

@ngxson Thank you for your comments/suggestions!

I'll take a look at this soon. I actually ran into a different issue when migrating the mllama code to the new vision api (it was a little premature to say it worked I'm afraid 😞 ) . As soon has I have this working like the previous version I'll return to this above issue.

ngxson Jan 23, 2025
Author

Ok I'm having a look now

I've verified that the pre-processing produces the same output

Already this is good, what I fear was that the preprocessing of mllama is not separated from the inference code.

Btw if you need help to compare output, I can provide you a cpp function to dump tensor data into file, then use python script to load it into list[float] then compare if they are closed to each other (within an epsilon for example)

Things that are different are how the image patch embeddings are handled in the newst version. The actual embedding tensor are copied to the context like this

One trick is that that you can use ggml_backend_tensor_get to get the tensor into float[] array, then set it as llama_batch embedding like the old version. We need to try this, just to isolate the problem. Probably you run into the problem because llama_batch with tensor input is a bit buggy for now.

auto * img_embd = llama_vision_get_output_tensor(ctx);
std::vector<float> output_debug(ggml_nelements(img_embd));
ggml_backend_tensor_get(img_embd, output_debug.data(), 0, ggml_nbytes(img_embd));
for (int row = 0; row < 10; row++) {
    int off = row * img_embd->ne[0];
    printf("... %f %f %f\n", output_debug[off], output_debug[off+1], output_debug[off+2]);
}

ngxson Jan 23, 2025
Author

Probably you run into the problem because llama_batch with tensor input is a bit buggy for now.

A bit more clear in this, ubatch does not support "cutting" the tensor in half if it does not fit into the physical batch limit.

I'll need to work on this later.

danbev Jan 23, 2025
Maintainer

Ah that is interesting and does sound like I might be running to this. I'll try what you suggested and set the image patch embedding on the llama_batch like the previous version does. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about llama 3.2 vision #8

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Discussion about llama 3.2 vision #8

ngxson Jan 23, 2025

Replies: 1 comment · 4 replies

ngxson Jan 23, 2025 Author

danbev Jan 23, 2025 Maintainer

ngxson Jan 23, 2025 Author

ngxson Jan 23, 2025 Author

danbev Jan 23, 2025 Maintainer

ngxson
Jan 23, 2025

Replies: 1 comment 4 replies

ngxson
Jan 23, 2025
Author

danbev Jan 23, 2025
Maintainer

ngxson Jan 23, 2025
Author

ngxson Jan 23, 2025
Author

danbev Jan 23, 2025
Maintainer