Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-9588 - Take reconfigurationLock at a higher level #4645

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cheukt
Copy link
Member

@cheukt cheukt commented Dec 18, 2024

take the reconfigurationLock earlier and at a higher level. This may slow down reconfiguration, but ultimately should stop go routines interleaving and make reconfiguration easier to reason about

@cheukt cheukt marked this pull request as ready for review December 18, 2024 22:37
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 18, 2024
@benjirewis
Copy link
Member

This was brought up offline briefly, but why are we adding extra locking here? It sounds like we haven't seen an issue wrt multiple updateWeakDependents racing, and I would go so far as to say that adding the locking you have here makes us at least susceptible to a new kind of deadlock.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025
@cheukt cheukt changed the title RSDK-9588 - Always run updateWeakDependent within a lock RSDK-9588 - Take reconfigurationLock at a higher level Jan 3, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025
Copy link
Member Author

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every thing that was locked before is still locked, just running tests

@@ -55,9 +54,7 @@ type resourceManager struct {
opts resourceManagerOptions
logger logging.Logger

// resourceGraphLock manages access to the resource graph and nodes. If either may change, this lock should be taken.
resourceGraphLock sync.Mutex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the lock up to local_robot instead

@@ -172,7 +175,9 @@ func (r *localRobot) Close(ctx context.Context) error {
err = multierr.Combine(err, r.cloudConnSvc.Close(ctx))
}
if r.manager != nil {
r.reconfigurationLock.Lock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we grab the lock here for longer compared to inside resourceManager.Close()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're just mentioning that as part of the warning in the PR description about "locking for longer/more often than we were locking before"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, this is the biggest change/risk based on the amount of extra locking, glad the tests pass haha

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025
@@ -4487,7 +4489,7 @@ func TestRemovingOfflineRemotes(t *testing.T) {
// Ensure that the remote is not marked for removal while trying to connect to the remote
remote, ok := localRobot.manager.resources.Node(remoteName)
test.That(t, ok, test.ShouldBeTrue)
test.That(t, remote.MarkedForRemoval(), test.ShouldBeTrue)
test.That(t, remote.MarkedForRemoval(), test.ShouldBeFalse)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why this was true to begin with, the comment says this should be false. In any case, false makes more sense - we shouldn't have gotten into Reconfigure yet at this point

Copy link
Member

@dgottlieb dgottlieb Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty scary this was passing pre-PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree; how could it have been passing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pre-PR, the Reconfigure function can progress until after the remote is marked for removal before it needs the lock, which means it can be true

@@ -1113,7 +1129,7 @@ func (r *localRobot) applyLocalModuleVersions(cfg *config.Config) {
}
}

func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config, forceSync bool) {
func (r *localRobot) reconfigureInLock(ctx context.Context, newConfig *config.Config, forceSync bool) {
Copy link
Member

@dgottlieb dgottlieb Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment off-diff. I'm not sure if there's a thing to do in this PR, or whether it's worth postponing.

But now that we have a lock for reconfigure, do we still need the localRobot.reconfiguring variable that this function sets to true while it's running? If all of the readers of that variable are now taking the reconfigurationLock, we can presumably remove it altogether.

Same question for the localRobot.configRevisionMu. And maybe localRobot.lastWeakDependentsRound?

I'm perfectly happy to postpone investigation after merging the PR. I very much want this change to go in. But I can't tell if those variables remain unchanged because changing them is incorrect, or if we haven't taken a closer look yet on how this bigger lock impacts their usage/necessity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localRobot.reconfiguring is still used as part of determining whether it's alright for viam agent to restart viam server, so that one, at least, is probably still necessary. You raise good points about the other two, though; I'd be happy to look at those as part of a separate ticket/PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to call this method reconfigureInLock, though? That name now confuses me with the presence of reconfigurationLock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed back to reconfigure, I think we need the variables because the readers of the variables do not always take the lock. having the configRevisionMu means we won't have to wait for reconfiguration to complete to get the config revision. lastWeakDependentsRound is still needed to prevent extra weak dependent updates

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 3, 2025
@cheukt cheukt requested review from benjirewis and dgottlieb January 3, 2025 20:46
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for taking this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants