TriePagedAttentionCache #632

renxida · 2024-12-02T19:20:33Z

feat: Add TriePagedAttentionCache with initial implementation

Added TriePagedAttentionCache as an optional prefix sharing algorithm, selectable via:
config["paged_kv_cache"]["prefix_sharing_algorithm"] = "trie"

Current Status:

Basic implementation and unit tests complete
Integration test cases for both Base and Trie implementations, with trie implementation xfailed due to pending cache allocation improvements
BasePagedAttentionCache remains the default

Next Steps:
To achieve full functionality, we need to support cache re-allocations to extend the associated tokens & pages.

renxida · 2024-12-02T23:40:35Z

: D

shortfin/python/shortfin_apps/llm/components/kvcache/trie_attention_cache.py

stbaione

I think this could benefit from another set of eyes, but the overall interface and operations make sense to me

shortfin/python/shortfin_apps/llm/components/kvcache/base_attention_cache.py

shortfin/python/shortfin_apps/llm/components/kvcache/trie_attention_cache.py

…orithm in conftest.py

…requests use that instead

ScottTodd · 2024-12-04T17:21:16Z

shortfin/tests/apps/llm/components/kvcache/trie_attention_cache_test.py

@@ -0,0 +1,432 @@
+import pytest


Source files should have copyright + license header comments

ScottTodd · 2024-12-04T17:22:07Z

shortfin/tests/apps/llm/components/kvcache/trie_attention_cache_test.py

+    # Try to allocate new sequence - should evict least recently used unpublished sequence
+    new_tokens = list(range(1000, 1000 + TEST_PAGE_SIZE))
+    print(f"\nAttempting to allocate new sequence: {new_tokens}")
+    new_alloc = trie_cache.acquire_pages_for_tokens(new_tokens, extra_token_slots=0)


This test is failing on Windows: https://github.com/nod-ai/shark-ai/actions/runs/12164613667/job/33926704492?pr=635#step:11:3315

(The Windows shortfin build has been broken until #635, these test failures just slipped in right before I got that passing)

================================== FAILURES =================================== ____________________________ test_lru_eviction[1] _____________________________ trie_cache = <shortfin_apps.llm.components.kvcache.trie_attention_cache.TriePagedAttentionCache object at 0x000002464792FBF0> access_count = 1 @pytest.mark.parametrize( "access_count", [1, TEST_POOL_CAPACITY // 2, TEST_POOL_CAPACITY - 1] ) def test_lru_eviction(trie_cache, access_count): """Test LRU eviction with different access patterns""" print(f"\nStarting test_lru_eviction with access_count={access_count}") # Create mix of published and unpublished sequences keep_published = 3 # Number of sequences to keep published sequences = [] # First add some sequences we'll keep published print("\nPublishing sequences to keep active:") for i in range(keep_published): tokens = list(range(i * 100, i * 100 + TEST_PAGE_SIZE)) alloc = trie_cache.acquire_pages_for_tokens(tokens, extra_token_slots=0) alloc.publish_pages_for_tokens(alloc.tokens[:TEST_PAGE_SIZE]) sequences.append(tokens) print(f"Published sequence {i} (keeping active)") print_tree_state(trie_cache, " ") # Then add sequences we'll publish but release (evictable) print("\nAdding releasable sequences:") for i in range(keep_published, TEST_POOL_CAPACITY): tokens = list(range(i * 100, i * 100 + TEST_PAGE_SIZE)) alloc = trie_cache.acquire_pages_for_tokens(tokens, extra_token_slots=0) alloc.publish_pages_for_tokens(alloc.tokens[:TEST_PAGE_SIZE]) alloc.release_pages() # These can be evicted sequences.append(tokens) print(f"Added releasable sequence {i}") print_tree_state(trie_cache, " ") print("\nCache state before accessing sequences:") print_tree_state(trie_cache, " ") # Access some sequences to update their LRU status print(f"\nAccessing {access_count} sequences to update LRU order:") for i in range(access_count): print(f"\nAccessing sequence {i}:") alloc = trie_cache.acquire_pages_for_tokens(sequences[i], extra_token_slots=0) print_tree_state(trie_cache, " ") alloc.release_pages() print(f"After releasing allocation {i}:") print_tree_state(trie_cache, " ") print("\nCache state before attempting new allocation:") print_tree_state(trie_cache, " ") print("\nAvailable pages in pool:", len(trie_cache.page_pool.available_pages)) # Try to allocate new sequence - should evict least recently used unpublished sequence new_tokens = list(range(1000, 1000 + TEST_PAGE_SIZE)) print(f"\nAttempting to allocate new sequence: {new_tokens}") > new_alloc = trie_cache.acquire_pages_for_tokens(new_tokens, extra_token_slots=0) tests\apps\llm\components\kvcache\trie_attention_cache_test.py:303: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ python\shortfin_apps\llm\components\kvcache\trie_attention_cache.py:371: in acquire_pages_for_tokens self._evict_pages(n_empty_pages - len(self.page_pool.available_pages)) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <shortfin_apps.llm.components.kvcache.trie_attention_cache.TriePagedAttentionCache object at 0x000002464792FBF0> max_pages = 1 def _evict_pages(self, max_pages: int) -> int: """Evict up to max_pages pages using LRU strategy. Evicts from unreferenced leaf nodes first, working up the trie as nodes become childless. Args: max_pages: Maximum number of pages to evict Returns: Number of pages actually evicted """ pages_to_evict = [] # Initialize heap with unreferenced leaves unused_leaf_heap = [ (leaf.access_time, leaf) for leaf in self.leaves if leaf.ref_count.is_empty() ] > heapq.heapify(unused_leaf_heap) E TypeError: '<' not supported between instances of 'TrieNode' and 'TrieNode' python\shortfin_apps\llm\components\kvcache\trie_attention_cache.py:407: TypeError

How is this even working on Linux? https://stackoverflow.com/questions/53554199/heapq-push-typeerror-not-supported-between-instances

Context: #632 (comment)

feat: Add TriePagedAttentionCache with initial implementation Added TriePagedAttentionCache as an optional prefix sharing algorithm, selectable via: `config["paged_kv_cache"]["prefix_sharing_algorithm"] = "trie"` Current Status: - Basic implementation and unit tests complete - Integration test cases for both Base and Trie implementations, with trie implementation xfailed due to pending cache allocation improvements - BasePagedAttentionCache remains the default Next Steps: To achieve full functionality, we need to support cache re-allocations to extend the associated tokens & pages.

Context: #632 (comment)

renxida marked this pull request as ready for review December 2, 2024 22:24

renxida requested a review from stbaione December 2, 2024 22:24

renxida mentioned this pull request Dec 3, 2024

[tracking] Minimal prefix-sharing kv cache #593

Closed

20 tasks

stbaione reviewed Dec 3, 2024

View reviewed changes

shortfin/python/shortfin_apps/llm/components/kvcache/trie_attention_cache.py Outdated Show resolved Hide resolved

stbaione reviewed Dec 3, 2024

View reviewed changes

shortfin/python/shortfin_apps/llm/components/kvcache/trie_attention_cache.py Show resolved Hide resolved

stbaione approved these changes Dec 3, 2024

View reviewed changes

rsuderman approved these changes Dec 4, 2024

View reviewed changes

renxida force-pushed the trie-3 branch from 1237888 to 7274064 Compare December 4, 2024 03:46

renxida added 21 commits December 3, 2024 20:53

add unit tests and fix numerous small problems

a1946b8

all tests passing

8437a7f

all but publishing working

24a4313

naming consistency Page -> Paged

5c87b2b

passing all tests now

07d3040

name and documentation update publish_pages -> publish_pages_for_tokens

8f83118

undo accidental edit of messages.py

5c37666

remove not-very-useful test case

802c4e2

fix missing rename

3c7cc6c

use trie by default

f59485e

add errmsg for unknown kvcache type

646c167

fix typoes

5801d27

trie doesn't work yet reverting to base for default

7f6e5a3

rename cache_type to prefix_sharing_algorithm

ca29397

add another test case for trie and xfail it

7abaa7c

fix xpass by fixing a place where i forgot to pass prefix_sharing_alg…

1e35634

…orithm in conftest.py

organize test cases

1a419f1

add an extend_allocation method to cache allocations and make decode …

85bc5fc

…requests use that instead

remove xfail because we are now PASSING!

feb3aa9

fix typo; force certain args to be kwargs

7e07cc2

use a reference counter class to replace int for ref_count in trie node

d71826f

renxida added 2 commits December 3, 2024 20:53

fixed problem with refcount not having a default

871b622

use integer ops instead of float with ceil / floor

c32be80

renxida force-pushed the trie-3 branch from 7274064 to c32be80 Compare December 4, 2024 04:56

renxida enabled auto-merge (squash) December 4, 2024 04:56

renxida merged commit de4d2fe into nod-ai:main Dec 4, 2024
16 of 20 checks passed

ScottTodd reviewed Dec 4, 2024

View reviewed changes

This was referenced Dec 4, 2024

[shortfin] Bump IREE dep to iree-3.1.0rc20241204. #635

Merged

Fix/skip trie_attention_cache_test on Windows. #645

Merged

ScottTodd added a commit that referenced this pull request Dec 6, 2024

Fix/skip trie_attention_cache_test on Windows. (#645)

7f6de06

Context: #632 (comment)

renxida mentioned this pull request Dec 12, 2024

TriePagedAttentionCache - 2 #628

Closed

monorimet pushed a commit that referenced this pull request Jan 8, 2025

Fix/skip trie_attention_cache_test on Windows. (#645)

5dcf396

Context: #632 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TriePagedAttentionCache #632

TriePagedAttentionCache #632

renxida commented Dec 2, 2024 •

edited

Loading

renxida commented Dec 2, 2024

stbaione left a comment

ScottTodd Dec 4, 2024

ScottTodd Dec 4, 2024

ScottTodd Dec 4, 2024

TriePagedAttentionCache #632

TriePagedAttentionCache #632

Conversation

renxida commented Dec 2, 2024 • edited Loading

renxida commented Dec 2, 2024

stbaione left a comment

Choose a reason for hiding this comment

ScottTodd Dec 4, 2024

Choose a reason for hiding this comment

ScottTodd Dec 4, 2024

Choose a reason for hiding this comment

ScottTodd Dec 4, 2024

Choose a reason for hiding this comment

renxida commented Dec 2, 2024 •

edited

Loading