From 80bc203f2b22c23ae6a96d27052c46de52520fd1 Mon Sep 17 00:00:00 2001
From: Daniel Bevenius <daniel.bevenius@gmail.com>
Date: Thu, 23 Jan 2025 13:21:24 +0100
Subject: [PATCH] docs: add more mllama vision notes

---
 notes/llama.cpp/llama-3-2-vision.md | 397 ++++++++++++++++++++++++++++
 1 file changed, 397 insertions(+)

diff --git a/notes/llama.cpp/llama-3-2-vision.md b/notes/llama.cpp/llama-3-2-vision.md
index aedb2db..3c2609a 100644
--- a/notes/llama.cpp/llama-3-2-vision.md
+++ b/notes/llama.cpp/llama-3-2-vision.md
@@ -231,6 +231,401 @@ change:
 This is obviously not a solution but it will allow me to test the model. I'm
 going to ask for input about what best way to handle this is.
 
+### New vision api issue
+So I've modified the mllama version that worked with the first new vision api
+and I've verified that the pre-processing produces the same output, and I've
+also added the same logging to the new version to make sure it is identical.
+
+This is the output from the old version:
+```console
+token = 27
+token = 91
+token = 1843
+token = 91
+token = 29
+token = 3923
+token = 374
+token = 304
+token = 420
+token = 2217
+token = 30
+token = 128009
+token = 128006
+token = 78191
+token = 128007
+token = 271
+Calculating optimal canvas for image 1280x748 with max_tiles=4, tile_size=560
+Possible ratios and their canvas sizes:
+  Ratio 1x1 -> Canvas 560x560 (scale_w=0.438 scale_h=0.749 selected=0.438)
+  Ratio 1x2 -> Canvas 560x1120 (scale_w=0.438 scale_h=1.497 selected=0.438)
+  Ratio 1x3 -> Canvas 560x1680 (scale_w=0.438 scale_h=2.246 selected=0.438)
+  Ratio 1x4 -> Canvas 560x2240 (scale_w=0.438 scale_h=2.995 selected=0.438)
+  Ratio 2x1 -> Canvas 1120x560 (scale_w=0.875 scale_h=0.749 selected=0.749)
+  Ratio 2x2 -> Canvas 1120x1120 (scale_w=0.875 scale_h=1.497 selected=0.875)
+  Ratio 3x1 -> Canvas 1680x560 (scale_w=1.312 scale_h=0.749 selected=0.749)
+  Ratio 4x1 -> Canvas 2240x560 (scale_w=1.750 scale_h=0.749 selected=0.749)
+Selected scale: 0.875000 (upscale=0)
+Candidate canvas 1120x1120 (area=1254400)
+Final selected canvas 1120x1120
+Get image size fit to canvas: img=1280x748, canvas=1120x1120, tile=560
+Now resize image to size: 1120x654
+Padding image to size 560x560 with aspect ratio 2x2
+Padded image to size 1120x1120
+Splitting into 2x2 tiles
+split_to_tiles: img_width=1120, img_height=1120, tile_width=560, tile_height=560, tiles_x=2, tiles_y=2
+
+Processing tile [0,0], source region: x=0-559, y=0-559
+  Tile[0,0] at (0,0): src=(16,147,193) -> dst=(16,147,193)
+  Tile[0,0] at (1,0): src=(15,146,192) -> dst=(15,146,192)
+  Tile[0,0] at (2,0): src=(12,145,192) -> dst=(12,145,192)
+  Tile[0,0] at (0,1): src=(15,148,194) -> dst=(15,148,194)
+  Tile[0,0] at (1,1): src=(14,148,193) -> dst=(14,148,193)
+  Tile[0,0] at (2,1): src=(10,147,192) -> dst=(10,147,192)
+  Tile[0,0] at (0,2): src=(8,145,189) -> dst=(8,145,189)
+  Tile[0,0] at (1,2): src=(7,145,190) -> dst=(7,145,190)
+  Tile[0,0] at (2,2): src=(5,145,191) -> dst=(5,145,191)
+
+Processing tile [1,0], source region: x=560-1119, y=0-559
+  Tile[1,0] at (0,0): src=(195,221,236) -> dst=(195,221,236)
+  Tile[1,0] at (1,0): src=(195,221,236) -> dst=(195,221,236)
+  Tile[1,0] at (2,0): src=(197,220,236) -> dst=(197,220,236)
+  Tile[1,0] at (0,1): src=(192,217,232) -> dst=(192,217,232)
+  Tile[1,0] at (1,1): src=(194,218,233) -> dst=(194,218,233)
+  Tile[1,0] at (2,1): src=(196,219,235) -> dst=(196,219,235)
+  Tile[1,0] at (0,2): src=(192,216,230) -> dst=(192,216,230)
+  Tile[1,0] at (1,2): src=(194,217,231) -> dst=(194,217,231)
+  Tile[1,0] at (2,2): src=(195,218,232) -> dst=(195,218,232)
+
+Processing tile [0,1], source region: x=0-559, y=560-1119
+  Tile[0,1] at (0,0): src=(38,34,35) -> dst=(38,34,35)
+  Tile[0,1] at (1,0): src=(25,21,23) -> dst=(25,21,23)
+  Tile[0,1] at (2,0): src=(0,0,0) -> dst=(0,0,0)
+  Tile[0,1] at (0,1): src=(24,20,21) -> dst=(24,20,21)
+  Tile[0,1] at (1,1): src=(18,14,15) -> dst=(18,14,15)
+  Tile[0,1] at (2,1): src=(0,0,0) -> dst=(0,0,0)
+  Tile[0,1] at (0,2): src=(13,9,10) -> dst=(13,9,10)
+  Tile[0,1] at (1,2): src=(11,7,8) -> dst=(11,7,8)
+  Tile[0,1] at (2,2): src=(16,11,13) -> dst=(16,11,13)
+
+Processing tile [1,1], source region: x=560-1119, y=560-1119
+  Tile[1,1] at (0,0): src=(126,124,129) -> dst=(126,124,129)
+  Tile[1,1] at (1,0): src=(216,214,220) -> dst=(216,214,220)
+  Tile[1,1] at (2,0): src=(177,176,181) -> dst=(177,176,181)
+  Tile[1,1] at (0,1): src=(109,107,112) -> dst=(109,107,112)
+  Tile[1,1] at (1,1): src=(223,221,227) -> dst=(223,221,227)
+  Tile[1,1] at (2,1): src=(182,181,186) -> dst=(182,181,186)
+  Tile[1,1] at (0,2): src=(109,108,113) -> dst=(109,108,113)
+  Tile[1,1] at (1,2): src=(225,224,230) -> dst=(225,224,230)
+  Tile[1,1] at (2,2): src=(185,184,189) -> dst=(185,184,189)
+Processing tile 0
+Processing tile 1
+Processing tile 2
+Processing tile 3
+nx=560, ny=2240
+aspect_ratio=6
+ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
+ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 2864.12 MiB
+ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 376.05 MiB
+
+Tile 0 first 10 values:
+  [0] = -1.558688
+  [1] = -1.573286
+  [2] = -1.617081
+  [3] = -1.675475
+  [4] = -1.719270
+  [5] = -1.733869
+  [6] = -1.748467
+  [7] = -1.763066
+  [8] = -1.792263
+  [9] = -1.792263
+
+Tile 1 first 10 values:
+  [0] = 1.054431
+  [1] = 1.054431
+  [2] = 1.083627
+  [3] = 1.083627
+  [4] = 1.083627
+  [5] = 1.098226
+  [6] = 1.127423
+  [7] = 1.142021
+  [8] = 1.127423
+  [9] = 1.112824
+
+Tile 2 first 10 values:
+  [0] = -1.237522
+  [1] = -1.427302
+  [2] = -1.792263
+  [3] = -0.288625
+  [4] = -0.098845
+  [5] = -1.047743
+  [6] = -0.040451
+  [7] = -1.164530
+  [8] = -1.660877
+  [9] = -1.558688
+
+Tile 3 first 10 values:
+  [0] = 0.047139
+  [1] = 1.360998
+  [2] = 0.791659
+  [3] = 0.587281
+  [4] = 0.879250
+  [5] = 0.061738
+  [6] = -1.587885
+  [7] = -1.704672
+  [8] = -1.792263
+  [9] = -1.792263
+n_positions bytes: 6404, n_positions: 1601
+check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [embeddings_after_position_embd] [1280 1601 4 1]
+check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [embeddings_after_tile_position_embd] [1280 1601 4 1]
+update_cuda_graph_executable: CUDA graph update failed
+update_cuda_graph_executable: CUDA graph update failed
+update_cuda_graph_executable: CUDA graph update failed
+ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
+vision encoder output[0] = 9.583341
+vision encoder output[1] = 14.313586
+vision encoder output[2] = -3.192569
+vision encoder output[3] = 5.813879
+vision encoder output[4] = 0.386942
+vision encoder output[5] = -13.529299
+vision encoder output[6] = -2.128806
+vision encoder output[7] = 3.152669
+vision encoder output[8] = -7.955503
+vision encoder output[9] = -4.424203
+ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
+ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 16.01 MiB to 25.66 MiB
+n_img_tokens = 1
+ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
+ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 25.66 MiB to 100.21 MiB
+ca_patch_emd[0] = 9.583341
+ca_patch_emd[1] = 14.313586
+ca_patch_emd[2] = -3.192569
+ca_patch_emd[3] = 5.813879
+ca_patch_emd[4] = 0.386942
+ca_patch_emd[5] = -13.529299
+ca_patch_emd[6] = -2.128806
+ca_patch_emd[7] = 3.152669
+ca_patch_emd[8] = -7.955503
+ca_patch_emd[9] = -4.424203
+The image depicts a cityscape, with a large body of water in the background. The
+city appears to be densely populated, with many tall buildings and skyscrapers. In
+the background, there is a large body of water, possibly an ocean or a lake. The
+sky above is cloudy and h
+main: decoded 60 tokens in 19.11 s, speed: 3.14 t/s
+```
+One thing that is different is the `<|image|>` token is not resovled correctly
+with this version but that is something I've fixed in the newest version.
+
+This is the ouput from the newest version:
+```console
+token = 128256
+token = 3923
+token = 374
+token = 304
+token = 420
+token = 2217
+token = 30
+token = 128009
+token = 128006
+token = 78191
+token = 128007
+token = 271
+ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
+check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [l_out-9] [4096 4 1 1]
+Decoded prefix prompt
+loaded image examples/vision-mllama/ny.jpg, size = 1280 x 748
+Calculating optimal canvas for image 1280x748 with max_tiles=4, tile_size=560
+Possible ratios and their canvas sizes:
+  Ratio 1x1 -> Canvas 560x560 (scale_w=0.438 scale_h=0.749 selected=0.438)
+  Ratio 1x2 -> Canvas 560x1120 (scale_w=0.438 scale_h=1.497 selected=0.438)
+  Ratio 1x3 -> Canvas 560x1680 (scale_w=0.438 scale_h=2.246 selected=0.438)
+  Ratio 1x4 -> Canvas 560x2240 (scale_w=0.438 scale_h=2.995 selected=0.438)
+  Ratio 2x1 -> Canvas 1120x560 (scale_w=0.875 scale_h=0.749 selected=0.749)
+  Ratio 2x2 -> Canvas 1120x1120 (scale_w=0.875 scale_h=1.497 selected=0.875)
+  Ratio 3x1 -> Canvas 1680x560 (scale_w=1.312 scale_h=0.749 selected=0.749)
+  Ratio 4x1 -> Canvas 2240x560 (scale_w=1.750 scale_h=0.749 selected=0.749)
+Selected scale: 0.875000 (upscale=0)
+Candidate canvas 1120x1120 (area=1254400)
+Final selected canvas 1120x1120
+Get image size fit to canvas: img=1280x748, canvas=1120x1120, tile=560
+Now resize image to size: 1120x654
+Padding image to size 560x560 with aspect ratio 2x2
+Padded image to size 1120x1120
+Splitting into 2x2 tiles
+split_to_tiles: img_width=1120, img_height=1120, tile_width=560, tile_height=560, tiles_x=2, tiles_y=2
+
+Processing tile [0,0], source region: x=0-559, y=0-559
+  Tile[0,0] at (0,0): src=(16,147,193) -> dst=(16.00,147.00,193.00)
+  Tile[0,0] at (1,0): src=(15,146,192) -> dst=(15.00,146.00,192.00)
+  Tile[0,0] at (2,0): src=(12,145,192) -> dst=(12.00,145.00,192.00)
+  Tile[0,0] at (0,1): src=(15,148,194) -> dst=(15.00,148.00,194.00)
+  Tile[0,0] at (1,1): src=(14,148,193) -> dst=(14.00,148.00,193.00)
+  Tile[0,0] at (2,1): src=(10,147,192) -> dst=(10.00,147.00,192.00)
+  Tile[0,0] at (0,2): src=(8,145,189) -> dst=(8.00,145.00,189.00)
+  Tile[0,0] at (1,2): src=(7,145,190) -> dst=(7.00,145.00,190.00)
+  Tile[0,0] at (2,2): src=(5,145,191) -> dst=(5.00,145.00,191.00)
+
+Processing tile [1,0], source region: x=560-1119, y=0-559
+  Tile[1,0] at (0,0): src=(195,221,236) -> dst=(195.00,221.00,236.00)
+  Tile[1,0] at (1,0): src=(195,221,236) -> dst=(195.00,221.00,236.00)
+  Tile[1,0] at (2,0): src=(197,220,236) -> dst=(197.00,220.00,236.00)
+  Tile[1,0] at (0,1): src=(192,217,232) -> dst=(192.00,217.00,232.00)
+  Tile[1,0] at (1,1): src=(194,218,233) -> dst=(194.00,218.00,233.00)
+  Tile[1,0] at (2,1): src=(196,219,235) -> dst=(196.00,219.00,235.00)
+  Tile[1,0] at (0,2): src=(192,216,230) -> dst=(192.00,216.00,230.00)
+  Tile[1,0] at (1,2): src=(194,217,231) -> dst=(194.00,217.00,231.00)
+  Tile[1,0] at (2,2): src=(195,218,232) -> dst=(195.00,218.00,232.00)
+
+Processing tile [0,1], source region: x=0-559, y=560-1119
+  Tile[0,1] at (0,0): src=(38,34,35) -> dst=(38.00,34.00,35.00)
+  Tile[0,1] at (1,0): src=(25,21,23) -> dst=(25.00,21.00,23.00)
+  Tile[0,1] at (2,0): src=(0,0,0) -> dst=(0.00,0.00,0.00)
+  Tile[0,1] at (0,1): src=(24,20,21) -> dst=(24.00,20.00,21.00)
+  Tile[0,1] at (1,1): src=(18,14,15) -> dst=(18.00,14.00,15.00)
+  Tile[0,1] at (2,1): src=(0,0,0) -> dst=(0.00,0.00,0.00)
+  Tile[0,1] at (0,2): src=(13,9,10) -> dst=(13.00,9.00,10.00)
+  Tile[0,1] at (1,2): src=(11,7,8) -> dst=(11.00,7.00,8.00)
+  Tile[0,1] at (2,2): src=(16,11,13) -> dst=(16.00,11.00,13.00)
+
+Processing tile [1,1], source region: x=560-1119, y=560-1119
+  Tile[1,1] at (0,0): src=(126,124,129) -> dst=(126.00,124.00,129.00)
+  Tile[1,1] at (1,0): src=(216,214,220) -> dst=(216.00,214.00,220.00)
+  Tile[1,1] at (2,0): src=(177,176,181) -> dst=(177.00,176.00,181.00)
+  Tile[1,1] at (0,1): src=(109,107,112) -> dst=(109.00,107.00,112.00)
+  Tile[1,1] at (1,1): src=(223,221,227) -> dst=(223.00,221.00,227.00)
+  Tile[1,1] at (2,1): src=(182,181,186) -> dst=(182.00,181.00,186.00)
+  Tile[1,1] at (0,2): src=(109,108,113) -> dst=(109.00,108.00,113.00)
+  Tile[1,1] at (1,2): src=(225,224,230) -> dst=(225.00,224.00,230.00)
+  Tile[1,1] at (2,2): src=(185,184,189) -> dst=(185.00,184.00,189.00)
+Processing tile 0
+Processing tile 1
+Processing tile 2
+Processing tile 3
+n_px=40, n_py=40
+px=560, py=2240
+aspect_ratio=6
+vision_image_encode_mllama: image_size = 560
+vision_image_encode_mllama: num_positions = 1601
+ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
+ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 669.48 MiB to 2864.12 MiB
+ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 9.01 MiB to 376.05 MiB
+
+Tile 0 first 10 values:
+  [0] = -1.558688
+  [1] = -1.573286
+  [2] = -1.617081
+  [3] = -1.675475
+  [4] = -1.719270
+  [5] = -1.733869
+  [6] = -1.748467
+  [7] = -1.763066
+  [8] = -1.792263
+  [9] = -1.792263
+
+Tile 1 first 10 values:
+  [0] = 1.054431
+  [1] = 1.054431
+  [2] = 1.083627
+  [3] = 1.083627
+  [4] = 1.083627
+  [5] = 1.098226
+  [6] = 1.127423
+  [7] = 1.142021
+  [8] = 1.127423
+  [9] = 1.112824
+
+Tile 2 first 10 values:
+  [0] = -1.237522
+  [1] = -1.427302
+  [2] = -1.792263
+  [3] = -0.288625
+  [4] = -0.098845
+  [5] = -1.047743
+  [6] = -0.040451
+  [7] = -1.164530
+  [8] = -1.660877
+  [9] = -1.558688
+
+Tile 3 first 10 values:
+  [0] = 0.047139
+  [1] = 1.360998
+  [2] = 0.791659
+  [3] = 0.587281
+  [4] = 0.879250
+  [5] = 0.061738
+  [6] = -1.587885
+  [7] = -1.704672
+  [8] = -1.792263
+  [9] = -1.792263
+n_positions bytes: 6404, n_positions: 1601
+check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [embeddings_after_position_embd] [1280 1601 4 1]
+check_node_graph_compatibility_and_refresh_copy_ops: disabling CUDA graphs due to batch size > 1 [embeddings_after_tile_position_embd] [1280 1601 4 1]
+update_cuda_graph_executable: CUDA graph update failed
+update_cuda_graph_executable: CUDA graph update failed
+update_cuda_graph_executable: CUDA graph update failed
+ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
+vision encoder output[0] = 9.583341
+vision encoder output[1] = 14.313586
+vision encoder output[2] = -3.192569
+vision encoder output[3] = 5.813879
+vision encoder output[4] = 0.386942
+vision encoder output[5] = -13.529299
+vision encoder output[6] = -2.128806
+vision encoder output[7] = 3.152669
+vision encoder output[8] = -7.955503
+vision encoder output[9] = -4.424203
+encoded image
+image patch embeddings are in ctx_vision.vctx.output:
+name: img_patch_embd
+shape: [4096, 1601, 4, 1]
+embd_tensor[0] = 9.583341
+embd_tensor[1] = 14.313586
+embd_tensor[2] = -3.192569
+embd_tensor[3] = 5.813879
+embd_tensor[4] = 0.386942
+embd_tensor[5] = -13.529299
+embd_tensor[6] = -2.128806
+embd_tensor[7] = 3.152669
+embd_tensor[8] = -7.955503
+embd_tensor[9] = -4.424203
+The image is a picture of a city skyline, specifically the New York City skyline.
+
+main: decoded 17 tokens in 3.72 s, speed: 4.57 t/s
+```
+Now, the issue is that the output above was mostly "lucky" as other times it 
+will generate:
+```console
+I don't see an image, but I can try to help you if you describe the image or
+tell me what it's supposed to be.
+```
+I've inspected the output of the vision encoder and as far as I can tell they
+are identical:
+```
+Previous version:
+vision encoder output[0] = 9.583341
+vision encoder output[1] = 14.313586
+vision encoder output[2] = -3.192569
+vision encoder output[3] = 5.813879
+vision encoder output[4] = 0.386942
+vision encoder output[5] = -13.529299
+vision encoder output[6] = -2.128806
+vision encoder output[7] = 3.152669
+vision encoder output[8] = -7.955503
+vision encoder output[9] = -4.424203
+Latest version:
+vision encoder output[0] = 9.583341
+vision encoder output[1] = 14.313586
+vision encoder output[2] = -3.192569
+vision encoder output[3] = 5.813879
+vision encoder output[4] = 0.386942
+vision encoder output[5] = -13.529299
+vision encoder output[6] = -2.128806
+vision encoder output[7] = 3.152669
+vision encoder output[8] = -7.955503
+vision encoder output[9] = -4.424203
+```
+Hmm, but it could also be that it is only the first tile that is identical so
+perhaps I should print out the first 10 values of all 4 tiles.
+
 _work in progress_
 
 ### Model conversion
@@ -361,6 +756,8 @@ So I'm going to create two models for Llama 3.2 Vision Instruct and then take
 a look at how packaging multiple GGUFs could be done. Actually, using one .gguf
 for both will work so we will be converting into a single model.
 
+
+
 ### Language model layers (tensors)
 ```console
 "language_model.lm_head.weight"