-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16971 cart: add RPC origin address #15820
base: google/2.6
Are you sure you want to change the base?
Conversation
Ticket title is 'bulk cancel causing libfabric progress failure.' |
Add cart RPC origin address to error handling. Run-GHA: true Allow-unstable-test: true Signed-off-by: Di Wang <[email protected]>
db7cb67
to
24f7c0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments inline. Perhaps for the error cases we can include 'origin info' as part of RPC_ERROR and RPC_WARN as well - while there is some overhead it might be ok for error cases and help with debug.
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(rpc, addr, &addr_size); | ||
D_ERROR("rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d from %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a case like this it would be better to add a public variant of RPC_ERROR() macro and use it to print additional info, as it will print decoded opcode for example along with rpcid and other fields that are helpful in debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
D_ERROR("rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d from %s\n", | ||
rpc, opc_get(rpc->cr_opc), ioc->ioc_map_ver, orwo->orw_epoch, status, addr); | ||
} else { | ||
D_DEBUG(DB_IO, "rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here but a variant of RPC_DEBUG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -2375,6 +2376,18 @@ int crt_context_quota_limit_get(crt_context_t crt_ctx, crt_quota_type_t quota, i | |||
int | |||
crt_req_get_proto_ver(crt_rpc_t *req); | |||
|
|||
/** | |||
* Get the rpc original address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/original/origin/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
* | ||
* \param[in] rpc pointer to RPC request | ||
* \param[out] addr pointer to the converted buffer. | ||
* \param[in/out] addr_size the size of the converted buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be more clear for me for this to be 'addr_len' instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(cb_info->bci_bulk_desc->bd_rpc, addr, &addr_size); | ||
D_ERROR("bulk transfer failed: %d from %s\n", cb_info->bci_rc, addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should then include origin_addr info for any transfer fail inside of crt_bulk_transfer/cb so that it covers other places too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(rpc, addr, &addr_size); | ||
D_ERROR( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, a public variant of RPC_ERROR (e.g. D_RPC_ERROR) would be handy here to add and use
Functional on EL 8.8 Test Results131 tests 127 ✅ 1h 28m 48s ⏱️ Results for commit 24f7c0c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally not a big fan of spreading calls to crt_rpc_get_origin_addr() everywhere in the source code. We cannot use the TLS or even put it in a field of the cart RPC structure?
yeah, we've been further discussing this and jeff also suggested storing resolved addr on first lookup in rpc_priv |
Functional Hardware Large Test Results64 tests 64 ✅ 29m 32s ⏱️ Results for commit 24f7c0c. |
Add cart RPC origin address to error handling.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: