Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14903 container: refine task process in pmap_refresh_cb (#13491) #13682

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

liuxuezhao
Copy link
Contributor

@liuxuezhao liuxuezhao commented Jan 29, 2024

should register completion callback before task reinit, or the complete cb possibly cannot be triggered.

and a few other backports:
6d4e549 - DAOS-14788 pool: Fix some reinit usages (#13518)
91b93c8 - DAOS-13252 tests: set svcn for multiple_failure test (#13619)
d30e842 - DAOS-14903 object: fix bug in peer status check (#13585)

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@liuxuezhao liuxuezhao requested a review from a team as a code owner January 29, 2024 14:05
Copy link

Bug-tracker data:
Ticket title is 'fix bug in EC aggregation's peer update'
Status is 'Awaiting backport'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-14903

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Jan 29, 2024
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at least run the modified functional test and possibly some other features.

Features: EcodOnlineMultFail

should register completion callback before task reinit, or the complete
cb possibly cannot be triggered.

and a few other backports:
6d4e549 - DAOS-14788 pool: Fix some reinit usages (#13518)
91b93c8 - DAOS-13252 tests: set svcn for multiple_failure test (#13619)
d30e842 - DAOS-14903 object: fix bug in peer status check (#13585)

Required-githooks: true
Test-tag: pr ec_multiple_failure

Signed-off-by: Xuezhao Liu <[email protected]>
Signed-off-by: Li Wei <[email protected]>
@liuxuezhao liuxuezhao force-pushed the lxz/pmap_refresh_cb_2.4 branch from 890a562 to 1d6011d Compare January 30, 2024 01:16
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@liuxuezhao liuxuezhao added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Feb 1, 2024
@@ -1974,6 +1981,7 @@ map_refresh(tse_task_t *task)

D_DEBUG(DB_MD, DF_UUID": %p: asking rank %u for version > %u\n",
DP_UUID(pool->dp_pool), task, rank, version);
dc_pool_put(pool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we hold the reference during the RPC process on server side, and then ask map_refresh_cb() to release the reference? Then map_refresh_cb() do not need to get reference again. Otherwise, if it is possible that the pool may be freed before map_refresh_cb() triggered, then dc_pool_get() in map_refresh_cb() may access invalid memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave it to @liw reply as this part is backport from one of his PR, thx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nasf-Fan, arg->mra_pool is the reference that prevents the dc_pool from being freed before map_refresh_cb accesses it. The extra references taken in this PR are for the tricky reinit cases, where anther thread could execute the reinitialized task and (in theory) free the arg->mra_pool reference after the reinit call but before the function that calls reinit returns.

@Nasf-Fan Nasf-Fan self-requested a review February 1, 2024 14:59
@Nasf-Fan
Copy link
Contributor

Nasf-Fan commented Feb 2, 2024

Consider the following scenario: map_refresh() hold the reference at the beginning of the function, then registers map_refresh_cb for the task. Before RPC sending, it puts the reference on the pool. At that time, map_refresh_cb is not called yet. It is possible that someone free the pool before map_refresh_cb() being triggered, right? if yes, then calling dc_pool_get(pool) in map_refresh_cb() will access invalid memory?

@liuxuezhao
Copy link
Contributor Author

Consider the following scenario: map_refresh() hold the reference at the beginning of the function, then registers map_refresh_cb for the task. Before RPC sending, it puts the reference on the pool. At that time, map_refresh_cb is not called yet. It is possible that someone free the pool before map_refresh_cb() being triggered, right? if yes, then calling dc_pool_get(pool) in map_refresh_cb() will access invalid memory?

as my understanding, before map_refresh() called, in dc_pool_create_map_refresh_task() it take one extra reference "dc_pool_get(pool); a->mra_pool = pool;", and that reference will be released at map_refresh_cb()'s "!reinit" case. that would aovid the case you worried? @Nasf-Fan

@Nasf-Fan
Copy link
Contributor

Nasf-Fan commented Feb 2, 2024

Consider the following scenario: map_refresh() hold the reference at the beginning of the function, then registers map_refresh_cb for the task. Before RPC sending, it puts the reference on the pool. At that time, map_refresh_cb is not called yet. It is possible that someone free the pool before map_refresh_cb() being triggered, right? if yes, then calling dc_pool_get(pool) in map_refresh_cb() will access invalid memory?

as my understanding, before map_refresh() called, in dc_pool_create_map_refresh_task() it take one extra reference "dc_pool_get(pool); a->mra_pool = pool;", and that reference will be released at map_refresh_cb()'s "!reinit" case. that would aovid the case you worried? @Nasf-Fan

Thanks for the explain. I am wondering whether map_refresh_cb() can properly distinguish "reinit" or not, for example, if someone triggered reinit, then as shown following:

tse_task_reinit_with_delay(tse_task_t *task, uint64_t delay)
{               
...
        task->dt_result = 0;

        /** Move back to init list */
        if (delay == 0) {
                dtp->dtp_wakeup_time = 0;
                d_list_move_tail(&dtp->dtp_list, &dsp->dsp_init_list);
        } else {
                dtp->dtp_wakeup_time = daos_getutime() + delay;
                d_list_del_init(&dtp->dtp_list);
                tse_task_insert_sleeping(dtp, dsp);
        }

At that time, dt_result is reset as zero, then whether set "reinit" flag depends on pool_tgt_query_map_out::po_rc, but related (re-sent) RPC may be not handled by server, then can be zero also?

The issue is not directly related with the back port patch. But may be worth for us to consider more.

@liw
Copy link
Contributor

liw commented Feb 2, 2024

At that time, dt_result is reset as zero, then whether set "reinit" flag depends on pool_tgt_query_map_out::po_rc, but related (re-sent) RPC may be not handled by server, then can be zero also?

@Nasf-Fan, the reinit variable in map_refresh_cb means "shall reinitialize the current task", not "the current task has been reinitialized before". Does that answer your concern? If not, what is your concern (sorry, I'm not confident that I get what you mean)?

@Nasf-Fan
Copy link
Contributor

Nasf-Fan commented Feb 2, 2024

At that time, dt_result is reset as zero, then whether set "reinit" flag depends on pool_tgt_query_map_out::po_rc, but related (re-sent) RPC may be not handled by server, then can be zero also?

@Nasf-Fan, the reinit variable in map_refresh_cb means "shall reinitialize the current task", not "the current task has been reinitialized before". Does that answer your concern? If not, what is your concern (sorry, I'm not confident that I get what you mean)?

I see, thanks

@liuxuezhao liuxuezhao requested a review from a team February 2, 2024 09:42
@daltonbohning daltonbohning merged commit 60bbb3a into release/2.4 Feb 2, 2024
36 of 37 checks passed
@daltonbohning daltonbohning deleted the lxz/pmap_refresh_cb_2.4 branch February 2, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-cherry-pick Cherry-pick from another branch that did not require additional edits priority Ticket has high priority (automatically managed)
Development

Successfully merging this pull request may close these issues.

5 participants