Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

Open
sck-at-ucy opened this issue Jan 26, 2025 · 2 comments

Comments

@sck-at-ucy
Copy link

Perhaps this is hopeless but I thought it would be worth asking. I know I am close to the memory limit. Is there still hope to fit the model on 3 M2U with 128GB RAM? The 4th node is being used on another project and was curious if it would fit in 3 nodes.

I have used sudo sysctl -w iogpu.wired_limit_mb=122000 on all three nodes.

The code is running and I can see the memory increasing but then it fails apparently before it completes loading the model because at the time of failure it is still 100% CPU utilization.

/opt/homebrew/bin/mpirun --mca oob_tcp_if_include bridge0 --mca btl_tcp_if_include bridge0 \
--map-by ppr:1:node --mca coll_tuned_use_dynamic_rules 1 \
--mca coll_tuned_allreduce_algorithm 5 --mca btl_tcp_links 4 \
--mca mpi_thread_multiple 0 --mca btl_tcp_eager_limit 4194304 \
--mca btl_tcp_sndbuf 8388608 --mca btl_tcp_rcvbuf 8388608 --mca btl self,tcp \
-x DYLD_LIBRARY_PATH=/opt/homebrew/lib/ \
-np 3 --host 10.0.0.1:1,10.0.0.3:1,localhost:1 \
/Users/m2/anaconda3/envs/pythonProject_StreamLit/bin/python /Users/m2/pipeline_generate.py \
--model /Volumes/PACIFIC-GROVE/DeepSeek-R1-3bit \
--prompt "What's better a straight or a flush in texas hold'em?" \
--max-tokens 1024
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 122000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[Mendocino:28151] *** Process received signal ***
[Mendocino:28151] Signal: Abort trap: 6 (6)
[Mendocino:28151] Signal code:  (0)
[Mendocino:28151] [ 0] 0   libsystem_platform.dylib            0x000000019e542e04 _sigtramp + 56
[Mendocino:28151] [ 1] 0   libsystem_pthread.dylib             0x000000019e50bf70 pthread_kill + 288
[Mendocino:28151] [ 2] 0   libsystem_c.dylib                   0x000000019e418908 abort + 128
[Mendocino:28151] [ 3] 0   libc++abi.dylib                     0x000000019e4c244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[Mendocino:28151] [ 4] 0   libc++abi.dylib                     0x000000019e4b0a24 _ZL28demangling_terminate_handlerv + 320
[Mendocino:28151] [ 5] 0   libobjc.A.dylib                     0x000000019e1593f4 _ZL15_objc_terminatev + 172
[Mendocino:28151] [ 6] 0   libc++abi.dylib                     0x000000019e4c1710 _ZSt11__terminatePFvvE + 16
[Mendocino:28151] [ 7] 0   libc++abi.dylib                     0x000000019e4c16b4 _ZSt9terminatev + 108
[Mendocino:28151] [ 8] 0   libdispatch.dylib                   0x000000019e359688 _dispatch_client_callout4 + 40
[Mendocino:28151] [ 9] 0   libdispatch.dylib                   0x000000019e375c88 _dispatch_mach_msg_invoke + 464
[Mendocino:28151] [10] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [11] 0   libdispatch.dylib                   0x000000019e3769dc _dispatch_mach_invoke + 456
[Mendocino:28151] [12] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [13] 0   libdispatch.dylib                   0x000000019e361764 _dispatch_lane_invoke + 432
[Mendocino:28151] [14] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [15] 0   libdispatch.dylib                   0x000000019e361730 _dispatch_lane_invoke + 380
[Mendocino:28151] [16] 0   libdispatch.dylib                   0x000000019e36c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[Mendocino:28151] [17] 0   libdispatch.dylib                   0x000000019e36c1ec _dispatch_workloop_worker_thread + 540
[Mendocino:28151] [18] 0   libsystem_pthread.dylib             0x000000019e5083d8 _pthread_wqthread + 288
[Mendocino:28151] [19] 0   libsystem_pthread.dylib             0x000000019e5070f0 start_wqthread + 8
[Mendocino:28151] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 28151 on node Mendocino exited on
signal 6 (Abort trap: 6).

@awni
Copy link
Member

awni commented Jan 26, 2025

Yes.. I added a fix for that in the most recent mlx-lm but it's not in PyPi. If you build the package from source it should work.

I think 3x128 GB is enough to run it in 3-bit (but not 4-bit). Also I need to upload a bf16 version of the model, the fp16 version doesn't work as well unfortunately. So just a heads up if you see any suspect behavior.

@sck-at-ucy
Copy link
Author

Great will try tomorrow and will let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants