-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
压测出现WC状态和QP错误 #117
Comments
E0222 20:07:21.119704 2179433 worker_pool.cpp:291] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f5cfc000000, length: 65536, dest_addr: 140123979120640, local_nic: erdma_0, peer_nic: 10.0.0.41:7712@erdma_0, dest_rkey: 39908608, retry_cnt: 0): general error |
你可以在其他的网络环境中(如 Mellanox 或普通 RoCEv2)尝试复现一下。根据提供的信息来看可能是driver方面的问题。 另外,依据此前类似问题的解决经验,可以尝试调节一些环境变量,特别是调高 MC_MAX_CQE_PER_CTX 的值,适度调低 MC_MAX_WR 的值,确认一下是否是因为 CQ 溢出修复不彻底所致。 |
看起来不行。我重新设置了MC_MAX_CQE_PER_CTX:65535,MC_MAX_WR:1,已经是最悬殊的差距了,仍然复现。而且和MC_MAX_CQE_PER_CTX:2048,MC_MAX_WR:1024观察到的现象类似,前者也并没有多跑几秒。仅从测试效果来看,不一定是这个原因。 |
双向读文件传数据压测
int RdmaContext::poll(int num_entries, ibv_wc *wc, int cq_index) {
int nr_poll = ibv_poll_cq(cq_list_[cq_index].native, num_entries, wc);
if (nr_poll < 0) {
LOG(ERROR) << "Failed to poll CQ " << cq_index << " of device "
<< device_name_;
return ERR_CONTEXT;
}
return nr_poll;
} 其中ibv_wc wc[kPollCount]; wc[i].status = [IBV_WC_GENERAL_ERR] = "general error",
以及
int num_events = epoll_wait(context_.eventFd(), &event, 1, 100);
if (event.data.fd == context_.context()->async_fd)
ibv_async_event event;
ibv_get_async_event(context_.context(), &event)
ibv_get_async_event后发现event_type是[IBV_EVENT_QP_FATAL] = "local work queue catastrophic error",
The text was updated successfully, but these errors were encountered: