Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

压测出现WC状态和QP错误 #117

Open
power-more opened this issue Feb 22, 2025 · 4 comments
Open

压测出现WC状态和QP错误 #117

power-more opened this issue Feb 22, 2025 · 4 comments

Comments

@power-more
Copy link
Contributor

双向读文件传数据压测
int RdmaContext::poll(int num_entries, ibv_wc *wc, int cq_index) {
int nr_poll = ibv_poll_cq(cq_list_[cq_index].native, num_entries, wc);
if (nr_poll < 0) {
LOG(ERROR) << "Failed to poll CQ " << cq_index << " of device "
<< device_name_;
return ERR_CONTEXT;
}
return nr_poll;
} 其中ibv_wc wc[kPollCount]; wc[i].status = [IBV_WC_GENERAL_ERR] = "general error",

以及
int num_events = epoll_wait(context_.eventFd(), &event, 1, 100);
if (event.data.fd == context_.context()->async_fd)
ibv_async_event event;
ibv_get_async_event(context_.context(), &event)
ibv_get_async_event后发现event_type是[IBV_EVENT_QP_FATAL] = "local work queue catastrophic error",

@power-more
Copy link
Contributor Author

E0222 20:07:21.119704 2179433 worker_pool.cpp:291] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f5cfc000000, length: 65536, dest_addr: 140123979120640, local_nic: erdma_0, peer_nic: 10.0.0.41:7712@erdma_0, dest_rkey: 39908608, retry_cnt: 0): general error
W0222 20:07:21.119717 2179435 worker_pool.cpp:399] Worker: Received context async event local work queue catastrophic error for context erdma_0

@alogfans
Copy link
Collaborator

IBV_WC_GENERAL_ERR: This event is generated when there is a transport error which cannot be described by the other specific events discussed here.

你可以在其他的网络环境中(如 Mellanox 或普通 RoCEv2)尝试复现一下。根据提供的信息来看可能是driver方面的问题。

另外,依据此前类似问题的解决经验,可以尝试调节一些环境变量,特别是调高 MC_MAX_CQE_PER_CTX 的值,适度调低 MC_MAX_WR 的值,确认一下是否是因为 CQ 溢出修复不彻底所致。

@power-more
Copy link
Contributor Author

power-more commented Feb 24, 2025

看起来不行。我重新设置了MC_MAX_CQE_PER_CTX:65535,MC_MAX_WR:1,已经是最悬殊的差距了,仍然复现。而且和MC_MAX_CQE_PER_CTX:2048,MC_MAX_WR:1024观察到的现象类似,前者也并没有多跑几秒。仅从测试效果来看,不一定是这个原因。
退一步讲,即使真的是填满 CQ cache的问题,那持续压测是否总会有跑满的时候呢?

@stmatengss
Copy link
Collaborator

Image
这是官方Spec里的一个解释,有一个Workaround的方式是重新建连,需要后续继续测试该问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants