-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-9588 - Take reconfigurationLock at a higher level #4645
base: main
Are you sure you want to change the base?
Conversation
This was brought up offline briefly, but why are we adding extra locking here? It sounds like we haven't seen an issue wrt multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every thing that was locked before is still locked, just running tests
@@ -55,9 +54,7 @@ type resourceManager struct { | |||
opts resourceManagerOptions | |||
logger logging.Logger | |||
|
|||
// resourceGraphLock manages access to the resource graph and nodes. If either may change, this lock should be taken. | |||
resourceGraphLock sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the lock up to local_robot instead
@@ -172,7 +175,9 @@ func (r *localRobot) Close(ctx context.Context) error { | |||
err = multierr.Combine(err, r.cloudConnSvc.Close(ctx)) | |||
} | |||
if r.manager != nil { | |||
r.reconfigurationLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we grab the lock here for longer compared to inside resourceManager.Close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're just mentioning that as part of the warning in the PR description about "locking for longer/more often than we were locking before"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, this is the biggest change/risk based on the amount of extra locking, glad the tests pass haha
@@ -4487,7 +4489,7 @@ func TestRemovingOfflineRemotes(t *testing.T) { | |||
// Ensure that the remote is not marked for removal while trying to connect to the remote | |||
remote, ok := localRobot.manager.resources.Node(remoteName) | |||
test.That(t, ok, test.ShouldBeTrue) | |||
test.That(t, remote.MarkedForRemoval(), test.ShouldBeTrue) | |||
test.That(t, remote.MarkedForRemoval(), test.ShouldBeFalse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why this was true to begin with, the comment says this should be false. In any case, false makes more sense - we shouldn't have gotten into Reconfigure yet at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty scary this was passing pre-PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree; how could it have been passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pre-PR, the Reconfigure function can progress until after the remote is marked for removal before it needs the lock, which means it can be true
robot/impl/local_robot.go
Outdated
@@ -1113,7 +1129,7 @@ func (r *localRobot) applyLocalModuleVersions(cfg *config.Config) { | |||
} | |||
} | |||
|
|||
func (r *localRobot) reconfigure(ctx context.Context, newConfig *config.Config, forceSync bool) { | |||
func (r *localRobot) reconfigureInLock(ctx context.Context, newConfig *config.Config, forceSync bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment off-diff. I'm not sure if there's a thing to do in this PR, or whether it's worth postponing.
But now that we have a lock for reconfigure, do we still need the localRobot.reconfiguring
variable that this function sets to true while it's running? If all of the readers of that variable are now taking the reconfigurationLock
, we can presumably remove it altogether.
Same question for the localRobot.configRevisionMu
. And maybe localRobot.lastWeakDependentsRound
?
I'm perfectly happy to postpone investigation after merging the PR. I very much want this change to go in. But I can't tell if those variables remain unchanged because changing them is incorrect, or if we haven't taken a closer look yet on how this bigger lock impacts their usage/necessity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
localRobot.reconfiguring
is still used as part of determining whether it's alright for viam agent to restart viam server, so that one, at least, is probably still necessary. You raise good points about the other two, though; I'd be happy to look at those as part of a separate ticket/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to call this method reconfigureInLock
, though? That name now confuses me with the presence of reconfigurationLock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed back to reconfigure
, I think we need the variables because the readers of the variables do not always take the lock. having the configRevisionMu means we won't have to wait for reconfiguration to complete to get the config revision. lastWeakDependentsRound is still needed to prevent extra weak dependent updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for taking this!
take the reconfigurationLock earlier and at a higher level. This may slow down reconfiguration, but ultimately should stop go routines interleaving and make reconfiguration easier to reason about