-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't run the example with metal #1029
Comments
@feelingsonice this seems to be by #1033 on my machine, can you please check if it works? |
@EricLBuehler It's not crashing with that error anymore but the request won't complete. It's stuck there for a solid 10 minutes now. It's not stuck if I remove the |
@feelingsonice does this happen if you comment out the line |
Commenting it out "un-stucks" it |
Ok, that's interesting. Can you run interactive mode with this model: |
I'm in a different project so I can't tag on the
So running My log:
|
It seems the dummy run completes in a normal time, but just to confirm - does chatting with the model work? If that's the case, how large is your request? |
So I changed to a testing lab and I'm running the exact code shown above (the same request). Running it with
but the request itself does not complete. Removing the
and the request does complete:
Removing the
And the request does complete in:
In all three of these I have:
|
Ah I think I may see. When we allocate the PagedAttention cache we need to write zeros (seems like this can be optimized, I'll look into this!), so if you have a large amount of RAM in your Mac it could take a long time. How much RAM do you have? |
I'm on M2 Max with 92 GB |
@feelingsonice I merged #1036 which speeds up the loading, but I can also reproduce the slowness on my M3 Max 64GB. I'll see what the slowdown is and try to add a fix ASAP! In the meantime, removing the PagedAttention line as above will enable you to work. |
Thank you so much! |
@feelingsonice I think that perhaps the memory size detection logic is incorrect on Mac, and we are reserving too much. If this is the case, then some memory will be put into paging, and access will be incredibly slow. How much memory is being reserved for you? If I set the amount in MB manually to a reasonable size (in the CLI, with --pa-gpu-mem, in Rust with the |
Describe the bug
Using the basic example from the home page:
works, but for some reason, when I tried the exact same thing when standing up my own Tcp connection, I got:
And the only thing I did was stand up a
TcpStream
between loading the model and sending the message. Not really sure what's going on. Obviously disappeared when I disabledmetal
.Latest commit or version
The text was updated successfully, but these errors were encountered: