-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pyproject.toml + tools Falcon 3 addition #402
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Which config would you recommend for the library?
checkpoint_repository: | ||
repo: tiiuae/Falcon3-10B-Instruct | ||
source: HF | ||
max_seq_len: 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this match the runtime tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_chunked_context: true | ||
kv_cache_free_gpu_mem_fraction: 0.62 | ||
request_default_max_tokens: 1000 | ||
total_token_limit: 500000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a round number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_token_limit defines a briton setting for how many tokens are queued inside C++. If queueing too many requests, we overload runtime. Not important, as long as it's in a decent range.
total_token_limit = 500000 is default.
kv_cache_free_gpu_mem_fraction: 0.62 is a hard setting.
Falcon-3-10B needs around 20GB Vram for weights. We have 20GB Vram left.
Of the rest of the 20GB we allocate 62% to Falcon-10B. The rest of 7.6GB go to the 1B model. 2 GB out of the 7.6Gb will go to the weights, while 5.6GB are for activations and KV cache.
truss==0.9.49
via git! Git still pins revisions!Adds falcon: