diff --git a/notes/llama.cpp/llama-3-2-vision.md b/notes/llama.cpp/llama-3-2-vision.md index 8ddf01b..69644c1 100644 --- a/notes/llama.cpp/llama-3-2-vision.md +++ b/notes/llama.cpp/llama-3-2-vision.md @@ -192,6 +192,27 @@ allow packing multiple GGUFs (like an archive). So I'm going to create two models for Llama 3.2 Vision Instruct and then take a look at how packaging multiple GGUFs could be done. + +The current `convert_hf_to_gguf.py` script really only support a single model +as output as it is now. But long term I think there will be multiple models +that have more than one model in them. I'm thinking of text-to-speech models +which can contain a voice decoder model in addition to the language model. + +So a language model like Llama 3.2 Vision Instruct be registereded using the +`@Model.register` decorator: +```python +@Model.register("MllamaForConditionalGeneration") +class MLlamaModel(Model): + model_arch = gguf.MODEL_ARCH.MLLAMA +``` +Now, if a model has a vision model in addition to the language model, we could +we might expect there to be a command line option to specify that the vision +model should be extracted to a separate model. So perhaps there should be +a vision_model attibute for the MLlamaModel class for the vision encoder which +would have a different `model_arch` attribute, like +`gguf.MODEL_ARCH.MLLAMA_VISION`. + + ### Language model layers (tensors) ```console "language_model.lm_head.weight" @@ -218,12 +239,26 @@ One interesting thing with this model is that is has a vocab size specified as: "vocab_size": 128256 ``` But the special token `<|image|>` is at index 128256, so the actual vocab size -is 128257. This causes problems as there are tensors that depend on the vocab -size 128256. +is 128257. We can see this by inspecting the actual vocabulary array in +convert_hf_to_gguf.py: +```python + tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=is_cli_non_interactive) + print(f'tokenizer len: {len(tokenizer.vocab)}') + vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab)) + assert max(tokenizer.vocab.values()) <= vocab_size +``` +```console +tokenizer len: 128257 +``` +This causes problems as there is a tensor that depend on the vocab size being +128256: +```console + 1: 525336576 | 4096, 128256, 1, 1 | Q6_K | output.weight +``` + +The image token needs to be in our models vocab, in `vocab.id_to_token` that is, +so that it is resolved correctly and the correct token id passed to the model. -The image token needs to be in our models vocab, in vocab.id_to_token so that -it is resolved correctly and the correct token id passed to the model. But -id_to_token is also how the vocab size is determined by other parts of llama.cpp. For example, in `llama_decode_impl`: ```c++ if (n_outputs_new) { @@ -233,8 +268,8 @@ For example, in `llama_decode_impl`: } ``` So as far as I can tell we need to have the additional image token in the -actual vocab list, `id_to_token` in llama.cpp. But using that vocab size when -calling vo +actual vocab list, `id_to_token` in llama.cpp. The vocabulary size is determined +by calling: ```c++ int32_t llama_vocab_n_tokens(const struct llama_vocab * vocab) { return vocab->n_tokens(); @@ -244,6 +279,17 @@ uint32_t llama_vocab::n_tokens() const { return (uint32_t) pimpl->id_to_token.size(); } ``` +And notice that this is using the size of the `id_to_token` vector +to determine the vocab size. Now, this vector is resized in llama-vocab.cpp: +```c++ + uint32_t n_tokens = gguf_get_arr_n(ctx, token_idx); + id_to_token.resize(n_tokens); +``` +```console +(gdb) p n_tokens +$1 = 128256 +``` + I think a way to handle this is to leave the vocab size as 128256 when converting the model, so that id_to_token will have the correct size. And then add a special token for the image token. @@ -295,6 +341,28 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) { { LLM_KV_TOKENIZER_MIDDLE_ID, special_fim_mid_id }, }; ``` +Hmm, this will still not work as if we print out the tokens for the following +prompt we will see that it will not use the correct image token id: +```console +prompt: <|image|>What is in this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|> + +token = 27 +token = 91 +token = 1843 +token = 91 +token = 29 +token = 3923 +token = 374 +token = 304 +token = 420 +token = 2217 +token = 30 +token = 128009 +token = 128006 +token = 78191 +token = 128007 +token = 271 +``` ### Tasks