Clean the duplicated processor in the quantized example #2705

clearloop · 2025-01-07T19:13:54Z

Hi team! Thanks to the awesome work that bringing rust to the game!

I found that the usage of LogitsProcessor in the quantized example is not proper which makes the chat process iterating unnecessary resources with bad performance as results, people trying out the example may think it is caused by candle (like the performance of candle sucks comparing with llama.cpp XD, I did think so before reviewing the code carefully )

since we are using a loop here

candle/candle-examples/examples/quantized/main.rs

Line 508 in 236c35e

for prompt_index in 0.. {

we don't have to inference all tokens again here

candle/candle-examples/examples/quantized/main.rs

Line 575 in 236c35e

let mut next_token = if !args.split_prompt {

instead, we can just move the LogitsProcessor out of the global interactive loop with extra cache of tokens including users' prompts

candle/candle-examples/examples/quantized/main.rs

Line 559 in 236c35e

let mut logits_processor = {

This could be related to #1939 , the example is super slow from the second prompt

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean the duplicated processor in the quantized example #2705

Clean the duplicated processor in the quantized example #2705

clearloop commented Jan 7, 2025 •

edited

Loading

Clean the duplicated processor in the quantized example #2705

Clean the duplicated processor in the quantized example #2705

Comments

clearloop commented Jan 7, 2025 • edited Loading

clearloop commented Jan 7, 2025 •

edited

Loading