Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean the duplicated processor in the quantized example #2705

Open
clearloop opened this issue Jan 7, 2025 · 0 comments
Open

Clean the duplicated processor in the quantized example #2705

clearloop opened this issue Jan 7, 2025 · 0 comments

Comments

@clearloop
Copy link

clearloop commented Jan 7, 2025

Hi team! Thanks to the awesome work that bringing rust to the game!

I found that the usage of LogitsProcessor in the quantized example is not proper which makes the chat process iterating unnecessary resources with bad performance as results, people trying out the example may think it is caused by candle (like the performance of candle sucks comparing with llama.cpp XD, I did think so before reviewing the code carefully )

since we are using a loop here

for prompt_index in 0.. {

we don't have to inference all tokens again here

let mut next_token = if !args.split_prompt {

instead, we can just move the LogitsProcessor out of the global interactive loop with extra cache of tokens including users' prompts

let mut logits_processor = {

This could be related to #1939 , the example is super slow from the second prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant