Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

sck-at-ucy · 2025-01-26T22:44:48Z

Perhaps this is hopeless but I thought it would be worth asking. I know I am close to the memory limit. Is there still hope to fit the model on 3 M2U with 128GB RAM? The 4th node is being used on another project and was curious if it would fit in 3 nodes.

I have used sudo sysctl -w iogpu.wired_limit_mb=122000 on all three nodes.

The code is running and I can see the memory increasing but then it fails apparently before it completes loading the model because at the time of failure it is still 100% CPU utilization.

/opt/homebrew/bin/mpirun --mca oob_tcp_if_include bridge0 --mca btl_tcp_if_include bridge0 \
--map-by ppr:1:node --mca coll_tuned_use_dynamic_rules 1 \
--mca coll_tuned_allreduce_algorithm 5 --mca btl_tcp_links 4 \
--mca mpi_thread_multiple 0 --mca btl_tcp_eager_limit 4194304 \
--mca btl_tcp_sndbuf 8388608 --mca btl_tcp_rcvbuf 8388608 --mca btl self,tcp \
-x DYLD_LIBRARY_PATH=/opt/homebrew/lib/ \
-np 3 --host 10.0.0.1:1,10.0.0.3:1,localhost:1 \
/Users/m2/anaconda3/envs/pythonProject_StreamLit/bin/python /Users/m2/pipeline_generate.py \
--model /Volumes/PACIFIC-GROVE/DeepSeek-R1-3bit \
--prompt "What's better a straight or a flush in texas hold'em?" \
--max-tokens 1024
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 122000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[Mendocino:28151] *** Process received signal ***
[Mendocino:28151] Signal: Abort trap: 6 (6)
[Mendocino:28151] Signal code:  (0)
[Mendocino:28151] [ 0] 0   libsystem_platform.dylib            0x000000019e542e04 _sigtramp + 56
[Mendocino:28151] [ 1] 0   libsystem_pthread.dylib             0x000000019e50bf70 pthread_kill + 288
[Mendocino:28151] [ 2] 0   libsystem_c.dylib                   0x000000019e418908 abort + 128
[Mendocino:28151] [ 3] 0   libc++abi.dylib                     0x000000019e4c244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[Mendocino:28151] [ 4] 0   libc++abi.dylib                     0x000000019e4b0a24 _ZL28demangling_terminate_handlerv + 320
[Mendocino:28151] [ 5] 0   libobjc.A.dylib                     0x000000019e1593f4 _ZL15_objc_terminatev + 172
[Mendocino:28151] [ 6] 0   libc++abi.dylib                     0x000000019e4c1710 _ZSt11__terminatePFvvE + 16
[Mendocino:28151] [ 7] 0   libc++abi.dylib                     0x000000019e4c16b4 _ZSt9terminatev + 108
[Mendocino:28151] [ 8] 0   libdispatch.dylib                   0x000000019e359688 _dispatch_client_callout4 + 40
[Mendocino:28151] [ 9] 0   libdispatch.dylib                   0x000000019e375c88 _dispatch_mach_msg_invoke + 464
[Mendocino:28151] [10] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [11] 0   libdispatch.dylib                   0x000000019e3769dc _dispatch_mach_invoke + 456
[Mendocino:28151] [12] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [13] 0   libdispatch.dylib                   0x000000019e361764 _dispatch_lane_invoke + 432
[Mendocino:28151] [14] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [15] 0   libdispatch.dylib                   0x000000019e361730 _dispatch_lane_invoke + 380
[Mendocino:28151] [16] 0   libdispatch.dylib                   0x000000019e36c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[Mendocino:28151] [17] 0   libdispatch.dylib                   0x000000019e36c1ec _dispatch_workloop_worker_thread + 540
[Mendocino:28151] [18] 0   libsystem_pthread.dylib             0x000000019e5083d8 _pthread_wqthread + 288
[Mendocino:28151] [19] 0   libsystem_pthread.dylib             0x000000019e5070f0 start_wqthread + 8
[Mendocino:28151] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 28151 on node Mendocino exited on
signal 6 (Abort trap: 6).

The text was updated successfully, but these errors were encountered:

awni · 2025-01-26T22:47:50Z

Yes.. I added a fix for that in the most recent mlx-lm but it's not in PyPi. If you build the package from source it should work.

I think 3x128 GB is enough to run it in 3-bit (but not 4-bit). Also I need to upload a bf16 version of the model, the fp16 version doesn't work as well unfortunately. So just a heads up if you see any suspect behavior.

sck-at-ucy · 2025-01-26T22:53:14Z

Great will try tomorrow and will let you know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

sck-at-ucy commented Jan 26, 2025

awni commented Jan 26, 2025

sck-at-ucy commented Jan 26, 2025

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

Comments

sck-at-ucy commented Jan 26, 2025

awni commented Jan 26, 2025

sck-at-ucy commented Jan 26, 2025