Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pyproject.toml + tools Falcon 3 addition #402

Merged
merged 5 commits into from
Jan 23, 2025
Merged

Conversation

michaelfeil
Copy link
Contributor

  • tools are super outdated
  • poetry pins truss==0.9.49 via git! Git still pins revisions!

Adds falcon:

  • falcon 3 in favor of falcon-40B

Copy link
Member

@philipkiely-baseten philipkiely-baseten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Which config would you recommend for the library?

checkpoint_repository:
repo: tiiuae/Falcon3-10B-Instruct
source: HF
max_seq_len: 8192

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this match the runtime tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_chunked_context: true
kv_cache_free_gpu_mem_fraction: 0.62
request_default_max_tokens: 1000
total_token_limit: 500000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a round number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_token_limit defines a briton setting for how many tokens are queued inside C++. If queueing too many requests, we overload runtime. Not important, as long as it's in a decent range.

total_token_limit = 500000 is default.

kv_cache_free_gpu_mem_fraction: 0.62 is a hard setting.
Falcon-3-10B needs around 20GB Vram for weights. We have 20GB Vram left.
Of the rest of the 20GB we allocate 62% to Falcon-10B. The rest of 7.6GB go to the 1B model. 2 GB out of the 7.6Gb will go to the weights, while 5.6GB are for activations and KV cache.

@michaelfeil michaelfeil merged commit 22d8ada into main Jan 23, 2025
2 checks passed
@michaelfeil michaelfeil deleted the falcon-3-addition branch January 23, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants