Can't run the example with metal #1029

feelingsonice · 2025-01-05T20:44:23Z

Describe the bug

Using the basic example from the home page:

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

works, but for some reason, when I tried the exact same thing when standing up my own Tcp connection, I got:

2025-01-05T20:32:32.524216Z ERROR mistralrs_core::engine: step - Model failed with error: Metal(KernelError(MatMulNonContiguous { lhs_stride: [36992, 1156, 34, 1], rhs_stride: [104448, 96, 3072, 1], mnk: (34, 96, 34) }))
Error: ChatModelError { msg: "Metal error Invalid matmul arguments [36992, 1156, 34, 1] [104448, 96, 3072, 1] (34, 96, 34)", incomplete_response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736109152, model: "microsoft/Phi-3.5-mini-instruct", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 34, total_tokens: 34, avg_tok_per_sec: 1360.0, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.025, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } } }

And the only thing I did was stand up a TcpStream between loading the model and sending the message. Not really sure what's going on. Obviously disappeared when I disabled metal.

Latest commit or version

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = [
    "metal",
] }

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2025-01-07T20:18:30Z

@feelingsonice this seems to be by #1033 on my machine, can you please check if it works?

feelingsonice · 2025-01-07T20:36:29Z

@EricLBuehler It's not crashing with that error anymore but the request won't complete. It's stuck there for a solid 10 minutes now. It's not stuck if I remove the metal flag

EricLBuehler · 2025-01-07T21:02:30Z

@feelingsonice does this happen if you comment out the line .with_paged_attn?

feelingsonice · 2025-01-07T21:05:05Z

Commenting it out "un-stucks" it

EricLBuehler · 2025-01-07T21:07:37Z

Ok, that's interesting. Can you run interactive mode with this model: cargo run --features metal --release -- -i --paged-attn plain -m ...?

feelingsonice · 2025-01-07T21:12:50Z

I'm in a different project so I can't tag on the --features metal but I have:

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }

So running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... worked.

My log:

2025-01-07T21:09:21.414091Z  INFO candle_hf_hub: Token file not found "/Users/bli/.cache/huggingface/token"
2025-01-07T21:09:21.414139Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/bli/.cache/huggingface/token", using no HF token.
2025-01-07T21:09:21.414597Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.414710Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.463110Z DEBUG ureq::stream: connecting to huggingface.co:443 at [2600:9000:24ba:e400:17:b174:6d00:93a1]:443
2025-01-07T21:09:21.495077Z DEBUG rustls::client::hs: No cached session for DnsName("huggingface.co")
2025-01-07T21:09:21.495596Z DEBUG rustls::client::hs: Not resuming any session
2025-01-07T21:09:21.534252Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256
2025-01-07T21:09:21.534298Z DEBUG rustls::client::tls13: Not resuming
2025-01-07T21:09:21.534503Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]
2025-01-07T21:09:21.534517Z DEBUG rustls::client::hs: ALPN protocol is None
2025-01-07T21:09:21.535874Z DEBUG ureq::stream: created stream: Stream(RustlsStream)
2025-01-07T21:09:21.535897Z DEBUG ureq::unit: sending request GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.535937Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.681989Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.682059Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.682088Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.682936Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
2025-01-07T21:09:21.683979Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.684023Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.684054Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.795476Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.795519Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.795532Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.795680Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.795973Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.795996Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.796005Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.924264Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.924330Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.924356Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.924620Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.924661Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.924679Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:22.069944Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:22.070008Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.070033Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.070194Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:22.070530Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.070565Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.070583Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:22.209322Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:22.209384Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.209410Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.209743Z DEBUG ureq::stream: dropping stream: Stream(RustlsStream)
2025-01-07T21:09:22.210239Z  INFO mistralrs_core::pipeline::normal: Loading model `microsoft/Phi-3.5-mini-instruct` on metal[4294968260].
2025-01-07T21:09:22.210856Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `phi3`
2025-01-07T21:09:22.221611Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-01-07T21:09:22.221718Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 3072, intermediate_size: 8192, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 32, rms_norm_eps: 1e-5, rope_theta: 10000.0, bos_token_id: Some(1), eos_token_id: Some(32000), rope_scaling: Some(Classic { short_factor: [1.0, 1.0199999809265137, 1.0299999713897705, 1.0299999713897705, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.069999933242798, 1.0999999046325684, 1.1099998950958252, 1.1599998474121094, 1.1599998474121094, 1.1699998378753662, 1.2899998426437378, 1.339999794960022, 1.679999828338623, 1.7899998426437378, 1.8199998140335083, 1.8499997854232788, 1.879999756813049, 1.90999972820282, 1.9399996995925903, 1.9899996519088743, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0799996852874756, 2.0899996757507324, 2.189999580383301, 2.2199995517730713, 2.5899994373321533, 2.729999542236328, 2.749999523162842, 2.8399994373321533], long_factor: [1.0800000429153442, 1.1100000143051147, 1.1399999856948853, 1.340000033378601, 1.5899999141693115, 1.600000023841858, 1.6200000047683716, 2.620000123977661, 3.2300000190734863, 3.2300000190734863, 4.789999961853027, 7.400000095367432, 7.700000286102295, 9.09000015258789, 12.199999809265137, 17.670000076293945, 24.46000099182129, 28.57000160217285, 30.420001983642575, 30.840002059936523, 32.590003967285156, 32.93000411987305, 42.32000350952149, 44.96000289916992, 50.34000396728515, 50.45000457763672, 57.55000305175781, 57.93000411987305, 58.21000289916992, 60.1400032043457, 62.61000442504883, 62.62000274658203, 62.71000289916992, 63.1400032043457, 63.1400032043457, 63.77000427246094, 63.93000411987305, 63.96000289916992, 63.970001220703125, 64.02999877929688, 64.06999969482422, 64.08000183105469, 64.12000274658203, 64.41000366210938, 64.4800033569336, 64.51000213623047, 64.52999877929688, 64.83999633789063], scaling_type: Su }), max_position_embeddings: 131072, use_flash_attn: false, sliding_window: Some(262144), original_max_position_embeddings: 4096, quantization_config: None, tie_word_embeddings: false }
2025-01-07T21:09:23.176133Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q8_0) to 129 tensors.
2025-01-07T21:09:23.176208Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 1 threads.
2025-01-07T21:09:26.970334Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q8_0) to 129 tensors out of 129 total tensors. Took 3.79s
2025-01-07T21:09:26.995594Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|end|>", "<|assistant|>", unk_tok = <unk>
2025-01-07T21:09:26.998410Z  INFO mistralrs_core: Beginning dummy run.
2025-01-07T21:09:27.171115Z  INFO mistralrs_core: Dummy run completed in 0.172697333s.

EricLBuehler · 2025-01-07T21:20:48Z

So running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... worked.

It seems the dummy run completes in a normal time, but just to confirm - does chatting with the model work?

If that's the case, how large is your request?

feelingsonice · 2025-01-07T21:33:25Z

So I changed to a testing lab and I'm running the exact code shown above (the same request). Running it with with_paged_attn and cargo run --release gives me:

Dummy run completed in 15.669873292s.

but the request itself does not complete.

Removing the with_paged_attn and running cargo run --release gives me:

Dummy run completed in 0.170387958s.

and the request does complete:

[src/main.rs:28:5] response.usage.avg_prompt_tok_per_sec = 300.88495
[src/main.rs:28:5] response.usage.avg_compl_tok_per_sec = 62.36403

Removing the with_paged_attn and running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... gives me:

Dummy run completed in 0.175141125s.

And the request does complete in:

[src/main.rs:28:5] response.usage.avg_prompt_tok_per_sec = 288.1356
[src/main.rs:28:5] response.usage.avg_compl_tok_per_sec = 62.58642

In all three of these I have:

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }

EricLBuehler · 2025-01-07T21:35:53Z

Ah I think I may see. When we allocate the PagedAttention cache we need to write zeros (seems like this can be optimized, I'll look into this!), so if you have a large amount of RAM in your Mac it could take a long time.

How much RAM do you have?

feelingsonice · 2025-01-07T21:36:31Z

I'm on M2 Max with 92 GB

EricLBuehler · 2025-01-07T22:07:58Z

@feelingsonice I merged #1036 which speeds up the loading, but I can also reproduce the slowness on my M3 Max 64GB. I'll see what the slowdown is and try to add a fix ASAP! In the meantime, removing the PagedAttention line as above will enable you to work.

feelingsonice · 2025-01-07T23:31:12Z

@feelingsonice I merged #1036 which speeds up the loading, but I can also reproduce the slowness on my M3 Max 64GB. I'll see what the slowdown is and try to add a fix ASAP! In the meantime, removing the PagedAttention line as above will enable you to work.

Thank you so much!

EricLBuehler · 2025-01-08T00:15:05Z

@feelingsonice I think that perhaps the memory size detection logic is incorrect on Mac, and we are reserving too much. If this is the case, then some memory will be put into paging, and access will be incredibly slow. How much memory is being reserved for you? If I set the amount in MB manually to a reasonable size (in the CLI, with --pa-gpu-mem, in Rust with the PagedAttentionMetaBuilder::with_gpu_memory), then everything works smoothly for me.

feelingsonice added the bug Something isn't working label Jan 5, 2025

EricLBuehler mentioned this issue Jan 7, 2025

Fix metal paged attn phi3 #1033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run the example with metal #1029

Can't run the example with metal #1029

feelingsonice commented Jan 5, 2025 •

edited

Loading

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025 •

edited

Loading

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 8, 2025

Can't run the example with metal #1029

Can't run the example with metal #1029

Comments

feelingsonice commented Jan 5, 2025 • edited Loading

Describe the bug

Latest commit or version

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025 • edited Loading

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 7, 2025

feelingsonice commented Jan 7, 2025

EricLBuehler commented Jan 8, 2025

feelingsonice commented Jan 5, 2025 •

edited

Loading

feelingsonice commented Jan 7, 2025 •

edited

Loading