Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在k8s中执行测试PSI出现OOM问题 #601

Open
sfqq12345 opened this issue Mar 4, 2025 · 2 comments
Open

在k8s中执行测试PSI出现OOM问题 #601

sfqq12345 opened this issue Mar 4, 2025 · 2 comments

Comments

@sfqq12345
Copy link

Issue Type

Feature

Search for existing issues similar to yours

Yes

Kuscia Version

kuscia version v0.13.0b0-1-g6a3c90c

Link to Relevant Documentation

No response

Question Details

我在k8s上以runK模式部署了kuscia,在alice方的pod内执行:scripts/user/create_example_job.sh,任务生成后在运行时出现OOM,k8s的节点内存有40G以上,在ConfigMap配置中可调度内存有40GB,用同样的方式部署在minikube上时,能正常运行。在k8s中运行时的部分错误相关日志如下:

(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.613] [info] [bucket.cc:37] psi protocol=3, rank=1, inputs_size=9892
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.613] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=9892 
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.616] [info] [thread_pool.cc:30] Create a fixed thread pool with size 23
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.622] [info] [rr22_psi.cc:163] out buffer begin
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.622] [info] [rr22_psi.cc:167] out buffer end
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:470] a_,b_ size:12684 12684
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:472] begin vole recv
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.626] [info] [rr22_oprf.cc:489] solve begin
(SPURuntime(device_id=None, party=alice) pid=1674) [2025-03-04 06:44:52.628] [info] [rr22_oprf.cc:491] solve end
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff1b9eb0ea2fd5acc21b4e228f01000000 Worker ID: a6856c01f38e50cac0a654813cb0cbcfd2dd9471b40313a93494e586 Node ID: fc6452ea87e31f656039caaa3869c31e7a822751814b583eced54513 Worker IP address: secretflow-task-20250304144315-single-psi-0-global.alice.svc Worker port: 10028 Worker PID: 1674 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

该问题可无限重现,请问这个问题和k8s的配置有关吗?还是有哪些参数需要调整?
@sfqq12345
Copy link
Author

补充说明,alice和bob使用的数据集文件是部署文档中的alice.csv和bob.csv

@YanZhuangz
Copy link
Contributor

补充说明,alice和bob使用的数据集文件是部署文档中的alice.csv和bob.csv

不确定 K8s 环境与 minikube 上有没有另外的规则限制。可以按以下方式进行尝试:

  1. 确定数据提供方式,在 RunK 模式下,需要数据源是远端数据,比如 OSS/minio,K8s 部署时 localfs 只能使用 runp 运行时,参考:。当然此问题不应该导致 OOM。
  2. 如果 K8s 有额外的资源限制,可以参考appimage 配置 resources 参数指定拉起的任务 pod 所分配的资源
  3. 如果仍有内存溢出问题,可以在任务 pod 存在时查看任务的 pod 所使用的资源大小,若无问题,可再进一步确定 Kuscia 的 node 可用资源是否存在问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants