You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
我在k8s上以runK模式部署了kuscia,在alice方的pod内执行:scripts/user/create_example_job.sh,任务生成后在运行时出现OOM,k8s的节点内存有40G以上,在ConfigMap配置中可调度内存有40GB,用同样的方式部署在minikube上时,能正常运行。在k8s中运行时的部分错误相关日志如下:
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.613] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=9892
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.613] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=9892
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.616] [info] [thread_pool.cc:30] Create a fixed thread pool with size 23
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.622] [info] [rr22_psi.cc:163] out buffer begin
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.622] [info] [rr22_psi.cc:167] out buffer end
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:470] a_,b_ size:12684 12684
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:472] begin vole recv
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:489] solve begin
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.628] [info] [rr22_oprf.cc:491] solve end
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1b9eb0ea2fd5acc21b4e228f01000000 Worker ID: a6856c01f38e50cac0a654813cb0cbcfd2dd9471b40313a93494e586 Node ID: fc6452ea87e31f656039caaa3869c31e7a822751814b583eced54513 Worker IP address: secretflow-task-20250304144315-single-psi-0-global.alice.svc Worker port: 10028 Worker PID: 1674 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
该问题可无限重现,请问这个问题和k8s的配置有关吗?还是有哪些参数需要调整?
The text was updated successfully, but these errors were encountered:
Issue Type
Feature
Search for existing issues similar to yours
Yes
Kuscia Version
kuscia version v0.13.0b0-1-g6a3c90c
Link to Relevant Documentation
No response
Question Details
The text was updated successfully, but these errors were encountered: