Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run the example with metal #1029

Open
feelingsonice opened this issue Jan 5, 2025 · 13 comments
Open

Can't run the example with metal #1029

feelingsonice opened this issue Jan 5, 2025 · 13 comments
Labels
bug Something isn't working

Comments

@feelingsonice
Copy link

feelingsonice commented Jan 5, 2025

Describe the bug

Using the basic example from the home page:

use anyhow::Result;
use mistralrs::{
    IsqType, PagedAttentionMetaBuilder, TextMessageRole, TextMessages, TextModelBuilder,
};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct".to_string())
        .with_isq(IsqType::Q8_0)
        .with_logging()
        .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(
            TextMessageRole::System,
            "You are an AI agent with a specialty in programming.",
        )
        .add_message(
            TextMessageRole::User,
            "Hello! How are you? Please write generic binary search function in Rust.",
        );

    let response = model.send_chat_request(messages).await?;

    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    dbg!(
        response.usage.avg_prompt_tok_per_sec,
        response.usage.avg_compl_tok_per_sec
    );

    Ok(())
}

works, but for some reason, when I tried the exact same thing when standing up my own Tcp connection, I got:

2025-01-05T20:32:32.524216Z ERROR mistralrs_core::engine: step - Model failed with error: Metal(KernelError(MatMulNonContiguous { lhs_stride: [36992, 1156, 34, 1], rhs_stride: [104448, 96, 3072, 1], mnk: (34, 96, 34) }))
Error: ChatModelError { msg: "Metal error Invalid matmul arguments [36992, 1156, 34, 1] [104448, 96, 3072, 1] (34, 96, 34)", incomplete_response: ChatCompletionResponse { id: "1", choices: [Choice { finish_reason: "error", index: 0, message: ResponseMessage { content: Some(""), role: "assistant", tool_calls: [] }, logprobs: None }], created: 1736109152, model: "microsoft/Phi-3.5-mini-instruct", system_fingerprint: "local", object: "chat.completion", usage: Usage { completion_tokens: 0, prompt_tokens: 34, total_tokens: 34, avg_tok_per_sec: 1360.0, avg_prompt_tok_per_sec: inf, avg_compl_tok_per_sec: NaN, total_time_sec: 0.025, total_prompt_time_sec: 0.0, total_completion_time_sec: 0.0 } } }

And the only thing I did was stand up a TcpStream between loading the model and sending the message. Not really sure what's going on. Obviously disappeared when I disabled metal.

Latest commit or version

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = [
    "metal",
] }
@feelingsonice feelingsonice added the bug Something isn't working label Jan 5, 2025
@EricLBuehler
Copy link
Owner

@feelingsonice this seems to be by #1033 on my machine, can you please check if it works?

@feelingsonice
Copy link
Author

@EricLBuehler It's not crashing with that error anymore but the request won't complete. It's stuck there for a solid 10 minutes now. It's not stuck if I remove the metal flag

@EricLBuehler
Copy link
Owner

@feelingsonice does this happen if you comment out the line .with_paged_attn?

@feelingsonice
Copy link
Author

Commenting it out "un-stucks" it

@EricLBuehler
Copy link
Owner

Ok, that's interesting. Can you run interactive mode with this model: cargo run --features metal --release -- -i --paged-attn plain -m ...?

@feelingsonice
Copy link
Author

I'm in a different project so I can't tag on the --features metal but I have:

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }

So running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... worked.

My log:

2025-01-07T21:09:21.414091Z  INFO candle_hf_hub: Token file not found "/Users/bli/.cache/huggingface/token"
2025-01-07T21:09:21.414139Z  INFO mistralrs_core::utils::tokens: Could not load token at "/Users/bli/.cache/huggingface/token", using no HF token.
2025-01-07T21:09:21.414597Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.414710Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.463110Z DEBUG ureq::stream: connecting to huggingface.co:443 at [2600:9000:24ba:e400:17:b174:6d00:93a1]:443
2025-01-07T21:09:21.495077Z DEBUG rustls::client::hs: No cached session for DnsName("huggingface.co")
2025-01-07T21:09:21.495596Z DEBUG rustls::client::hs: Not resuming any session
2025-01-07T21:09:21.534252Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256
2025-01-07T21:09:21.534298Z DEBUG rustls::client::tls13: Not resuming
2025-01-07T21:09:21.534503Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]
2025-01-07T21:09:21.534517Z DEBUG rustls::client::hs: ALPN protocol is None
2025-01-07T21:09:21.535874Z DEBUG ureq::stream: created stream: Stream(RustlsStream)
2025-01-07T21:09:21.535897Z DEBUG ureq::unit: sending request GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.535937Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.681989Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.682059Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.682088Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.682936Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
2025-01-07T21:09:21.683979Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.684023Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.684054Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.795476Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.795519Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.795532Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.795680Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:21.795973Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.795996Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.796005Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:21.924264Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:21.924330Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.924356Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.924620Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:21.924661Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:21.924679Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:22.069944Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:22.070008Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.070033Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.070194Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-mini-instruct`
2025-01-07T21:09:22.070530Z DEBUG ureq::pool: pulling stream from pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.070565Z DEBUG ureq::unit: sending request (reused connection) GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.070583Z DEBUG ureq::unit: writing prelude: GET /api/models/microsoft/Phi-3.5-mini-instruct/revision/main HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; candle-hf-hub/0.3.3; rust/unknown
accept-encoding: gzip
2025-01-07T21:09:22.209322Z DEBUG ureq::response: Body entirely buffered (length: 5704)
2025-01-07T21:09:22.209384Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)
2025-01-07T21:09:22.209410Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/api/models/microsoft/Phi-3.5-mini-instruct/revision/main
2025-01-07T21:09:22.209743Z DEBUG ureq::stream: dropping stream: Stream(RustlsStream)
2025-01-07T21:09:22.210239Z  INFO mistralrs_core::pipeline::normal: Loading model `microsoft/Phi-3.5-mini-instruct` on metal[4294968260].
2025-01-07T21:09:22.210856Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `phi3`
2025-01-07T21:09:22.221611Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-01-07T21:09:22.221718Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 3072, intermediate_size: 8192, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 32, rms_norm_eps: 1e-5, rope_theta: 10000.0, bos_token_id: Some(1), eos_token_id: Some(32000), rope_scaling: Some(Classic { short_factor: [1.0, 1.0199999809265137, 1.0299999713897705, 1.0299999713897705, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.069999933242798, 1.0999999046325684, 1.1099998950958252, 1.1599998474121094, 1.1599998474121094, 1.1699998378753662, 1.2899998426437378, 1.339999794960022, 1.679999828338623, 1.7899998426437378, 1.8199998140335083, 1.8499997854232788, 1.879999756813049, 1.90999972820282, 1.9399996995925903, 1.9899996519088743, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0199997425079346, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0299997329711914, 2.0799996852874756, 2.0899996757507324, 2.189999580383301, 2.2199995517730713, 2.5899994373321533, 2.729999542236328, 2.749999523162842, 2.8399994373321533], long_factor: [1.0800000429153442, 1.1100000143051147, 1.1399999856948853, 1.340000033378601, 1.5899999141693115, 1.600000023841858, 1.6200000047683716, 2.620000123977661, 3.2300000190734863, 3.2300000190734863, 4.789999961853027, 7.400000095367432, 7.700000286102295, 9.09000015258789, 12.199999809265137, 17.670000076293945, 24.46000099182129, 28.57000160217285, 30.420001983642575, 30.840002059936523, 32.590003967285156, 32.93000411987305, 42.32000350952149, 44.96000289916992, 50.34000396728515, 50.45000457763672, 57.55000305175781, 57.93000411987305, 58.21000289916992, 60.1400032043457, 62.61000442504883, 62.62000274658203, 62.71000289916992, 63.1400032043457, 63.1400032043457, 63.77000427246094, 63.93000411987305, 63.96000289916992, 63.970001220703125, 64.02999877929688, 64.06999969482422, 64.08000183105469, 64.12000274658203, 64.41000366210938, 64.4800033569336, 64.51000213623047, 64.52999877929688, 64.83999633789063], scaling_type: Su }), max_position_embeddings: 131072, use_flash_attn: false, sliding_window: Some(262144), original_max_position_embeddings: 4096, quantization_config: None, tie_word_embeddings: false }
2025-01-07T21:09:23.176133Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q8_0) to 129 tensors.
2025-01-07T21:09:23.176208Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 1 threads.
2025-01-07T21:09:26.970334Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q8_0) to 129 tensors out of 129 total tensors. Took 3.79s
2025-01-07T21:09:26.995594Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|end|>", "<|assistant|>", unk_tok = <unk>
2025-01-07T21:09:26.998410Z  INFO mistralrs_core: Beginning dummy run.
2025-01-07T21:09:27.171115Z  INFO mistralrs_core: Dummy run completed in 0.172697333s.

@EricLBuehler
Copy link
Owner

So running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... worked.

It seems the dummy run completes in a normal time, but just to confirm - does chatting with the model work?

If that's the case, how large is your request?

@feelingsonice
Copy link
Author

feelingsonice commented Jan 7, 2025

So I changed to a testing lab and I'm running the exact code shown above (the same request). Running it with with_paged_attn and cargo run --release gives me:

Dummy run completed in 15.669873292s.

but the request itself does not complete.


Removing the with_paged_attn and running cargo run --release gives me:

Dummy run completed in 0.170387958s.

and the request does complete:

[src/main.rs:28:5] response.usage.avg_prompt_tok_per_sec = 300.88495
[src/main.rs:28:5] response.usage.avg_compl_tok_per_sec = 62.36403

Removing the with_paged_attn and running RUST_LOG=debug cargo run --release -- -i --paged-attn plain -m ... gives me:

Dummy run completed in 0.175141125s.

And the request does complete in:

[src/main.rs:28:5] response.usage.avg_prompt_tok_per_sec = 288.1356
[src/main.rs:28:5] response.usage.avg_compl_tok_per_sec = 62.58642

In all three of these I have:

mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }

@EricLBuehler
Copy link
Owner

Ah I think I may see. When we allocate the PagedAttention cache we need to write zeros (seems like this can be optimized, I'll look into this!), so if you have a large amount of RAM in your Mac it could take a long time.

How much RAM do you have?

@feelingsonice
Copy link
Author

I'm on M2 Max with 92 GB

@EricLBuehler
Copy link
Owner

@feelingsonice I merged #1036 which speeds up the loading, but I can also reproduce the slowness on my M3 Max 64GB. I'll see what the slowdown is and try to add a fix ASAP! In the meantime, removing the PagedAttention line as above will enable you to work.

@feelingsonice
Copy link
Author

@feelingsonice I merged #1036 which speeds up the loading, but I can also reproduce the slowness on my M3 Max 64GB. I'll see what the slowdown is and try to add a fix ASAP! In the meantime, removing the PagedAttention line as above will enable you to work.

Thank you so much!

@EricLBuehler
Copy link
Owner

@feelingsonice I think that perhaps the memory size detection logic is incorrect on Mac, and we are reserving too much. If this is the case, then some memory will be put into paging, and access will be incredibly slow. How much memory is being reserved for you? If I set the amount in MB manually to a reasonable size (in the CLI, with --pa-gpu-mem, in Rust with the PagedAttentionMetaBuilder::with_gpu_memory), then everything works smoothly for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants