压测出现WC状态和QP错误 #117

power-more · 2025-02-22T13:43:46Z

双向读文件传数据压测
int RdmaContext::poll(int num_entries, ibv_wc *wc, int cq_index) {
int nr_poll = ibv_poll_cq(cq_list_[cq_index].native, num_entries, wc);
if (nr_poll < 0) {
LOG(ERROR) << "Failed to poll CQ " << cq_index << " of device "
<< device_name_;
return ERR_CONTEXT;
}
return nr_poll;
} 其中ibv_wc wc[kPollCount]; wc[i].status = [IBV_WC_GENERAL_ERR] = "general error",

以及
int num_events = epoll_wait(context_.eventFd(), &event, 1, 100);
if (event.data.fd == context_.context()->async_fd)
ibv_async_event event;
ibv_get_async_event(context_.context(), &event)
ibv_get_async_event后发现event_type是[IBV_EVENT_QP_FATAL] = "local work queue catastrophic error"，

power-more · 2025-02-22T13:45:29Z

E0222 20:07:21.119704 2179433 worker_pool.cpp:291] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f5cfc000000, length: 65536, dest_addr: 140123979120640, local_nic: erdma_0, peer_nic: 10.0.0.41:7712@erdma_0, dest_rkey: 39908608, retry_cnt: 0): general error
W0222 20:07:21.119717 2179435 worker_pool.cpp:399] Worker: Received context async event local work queue catastrophic error for context erdma_0

alogfans · 2025-02-23T02:55:35Z

IBV_WC_GENERAL_ERR: This event is generated when there is a transport error which cannot be described by the other specific events discussed here.

你可以在其他的网络环境中（如 Mellanox 或普通 RoCEv2）尝试复现一下。根据提供的信息来看可能是driver方面的问题。

另外，依据此前类似问题的解决经验，可以尝试调节一些环境变量，特别是调高 MC_MAX_CQE_PER_CTX 的值，适度调低 MC_MAX_WR 的值，确认一下是否是因为 CQ 溢出修复不彻底所致。

power-more · 2025-02-24T03:47:19Z

看起来不行。我重新设置了MC_MAX_CQE_PER_CTX:65535，MC_MAX_WR:1，已经是最悬殊的差距了，仍然复现。而且和MC_MAX_CQE_PER_CTX:2048，MC_MAX_WR:1024观察到的现象类似，前者也并没有多跑几秒。仅从测试效果来看，不一定是这个原因。
退一步讲，即使真的是填满 CQ cache的问题，那持续压测是否总会有跑满的时候呢？

stmatengss · 2025-02-26T08:01:47Z

这是官方Spec里的一个解释，有一个Workaround的方式是重新建连，需要后续继续测试该问题。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

压测出现WC状态和QP错误 #117

压测出现WC状态和QP错误 #117

power-more commented Feb 22, 2025

power-more commented Feb 22, 2025

alogfans commented Feb 23, 2025

power-more commented Feb 24, 2025 •

edited

Loading

stmatengss commented Feb 26, 2025

压测出现WC状态和QP错误 #117

压测出现WC状态和QP错误 #117

Comments

power-more commented Feb 22, 2025

power-more commented Feb 22, 2025

alogfans commented Feb 23, 2025

power-more commented Feb 24, 2025 • edited Loading

stmatengss commented Feb 26, 2025

power-more commented Feb 24, 2025 •

edited

Loading