RSDK-9591 - Kill all lingering module process before exiting #4657

cheukt · 2024-12-27T16:13:22Z

This is part two of two PRs that will hopefully help with shutting down all module processes before viam-server exits. Part one is here

Before doing this, I looked into assigning module processes to the same process group as the viam server and just kill the process group. However, we already have each module and process assign to unique process groups, and we use that property to kill each modules and processes separately if necessary. Changing that behavior would be risky, so did not pursue that path further.

We could kill each process in mod manager directly using the exposed unixpid, but figured we could just do it within each managed process, that way we get support in windows as well. It does mean I added Kill() in a few interfaces, but it will hopefully be extensible in case anything else may need killing.

The idea behind this is for a Kill() call to propagate from the viam-server at the end of 90s, and we should not block on anything if possible. The Kill() does not care about the resource graph, only that we kill processes/module processes spawned by the server. I did not do the killing in parallel, since the calls will not block. I can see things racing with Close(), but I think the mitigation would be to make sure that kill/close is idempotent and will not panic if overlapping. This Kill() call does happen in the same goroutine that eventually calls log.Fatal, is that good enough for now or should we create a different goroutine so that we can guarantee that the viam-server exits by the 90s mark?

Ideas for testing? I've tested on a python module and observed that the module process does get killed, and would be good to test on setups where this is happening.

dgottlieb · 2024-12-31T21:17:00Z

web/server/entrypoint.go

@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 				case <-doneServing:


While we're here, I'd recommend removing this whole case <-doneServing stuff (and incidentally the select statement) and move straight to the killing/logging.

why is that? I thought the justification above for the code as it is makes sense

Ug -- it does. But it's a self-inflicted mess. I'll make a change after this goes in.

dgottlieb · 2024-12-31T21:17:46Z

web/server/entrypoint.go

@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 				case <-doneServing:
 					return true
 				default:
+					if myRobot != nil {


If myRobot can be nil here -- this is a data race.

would that be an issue? I can see that myRobot could have started some processes but not yet returned, but I don't know if we can protect against that completely.

Same remark as the other cases

added some locking around this

dgottlieb · 2024-12-31T21:21:41Z

robot/impl/local_robot.go

+// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
+// This operation is not clean and will not wait for completion.
+func (r *localRobot) Kill() {
+	if r.manager != nil {


Can we justify this isn't a data race?

r.manager could be nil if startup fails/hangs, but yes, it could also be a data race.

We talked offline -- I agree that the a mutex doesn't fix the "logical" race where we may observe the manager is nil a moment before it gets assigned.

But TSAN/Go's data race detection will notice this. Strictly speaking if one has two threads reading and writing a variable/memory address at the same time:

X initialized to 0 | Writer | Reader | |--------------+--------| | Write(X = 1) | Read X |

While our program may be OK with the reader seeing 0 or the reader seeing 1, it's not necessarily the case that for all architectures the Reader can only see 0 or 1.

looking at this again, this isn't a data race - there's no chance for Kill() to be called before r.manager is assigned (Kill() can only be called if robot exists, and robot only exists if r.manager is assigned)

Great -- much prefer to be able to assume members are non-nil.

dgottlieb · 2025-01-06T20:59:48Z

web/server/entrypoint.go

@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 				case <-doneServing:


Ug -- it does. But it's a self-inflicted mess. I'll make a change after this goes in.

dgottlieb · 2025-01-06T21:07:59Z

robot/impl/local_robot.go

+// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
+// This operation is not clean and will not wait for completion.
+func (r *localRobot) Kill() {
+	if r.manager != nil {


We talked offline -- I agree that the a mutex doesn't fix the "logical" race where we may observe the manager is nil a moment before it gets assigned.

But TSAN/Go's data race detection will notice this. Strictly speaking if one has two threads reading and writing a variable/memory address at the same time:

X initialized to 0 | Writer | Reader | |--------------+--------| | Write(X = 1) | Read X |

While our program may be OK with the reader seeing 0 or the reader seeing 1, it's not necessarily the case that for all architectures the Reader can only see 0 or 1.

dgottlieb · 2025-01-06T21:09:39Z

robot/impl/resource_manager.go

+	// TODO: Kill processes in processManager as well.
+
+	// moduleManager may be nil in tests
+	if manager.moduleManager != nil {


I suspect this can be a data race as well?

yep! added some locking around this

dgottlieb · 2025-01-06T21:10:02Z

web/server/entrypoint.go

@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 				case <-doneServing:
 					return true
 				default:
+					if myRobot != nil {


Same remark as the other cases

benjirewis · 2025-01-08T16:55:34Z

web/server/entrypoint.go

@@ -261,6 +262,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err
 	forceShutdown := make(chan struct{})
 	defer func() { <-forceShutdown }()

+	var myRobot robot.LocalRobot
+
 	utils.PanicCapturingGo(func() {


Would be awesome if this force shutdown goroutine was its own method/function as we've been doing for our various unnamed async lambdas.

I agree, but would like to defer that work to https://viam.atlassian.net/browse/RSDK-9708. There's a bit of refactoring that has to be done (it'd look pretty ugly unless we add some vars to the server object), and I'd rather do it separately

dgottlieb · 2025-01-09T15:08:49Z

robot/impl/local_robot.go

+// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
+// This operation is not clean and will not wait for completion.
+func (r *localRobot) Kill() {
+	r.manager.Kill()


It feels a little awkward that localRobot.Kill only calls kill on the resource manager.

And the resource manager only calls kill on the mod manager.

And localRobot already has a handle on the modmanager. So why doesn't it call kill directly? Or just have the part that's about to do the log.Fatal/os.Exit call kill on the modmanager?

keeping the abstractions I think is better in the long run - the resource manager will eventually call kill on the process manager.
the part that's about to do log.Fatal doesn't have a handle to the modmanager, since it only has access to the LocalRobot interface.

keeping the abstractions I think is better in the long run

If I'm adding some plumbing from local robot -> modmanager, when should I use localRobot.modmanager directly and when should I add additional plumbing through localRobot.manager?

FWIW, I'm fine with leaving this as-is. The above is me just coming to the realization that there's some abstraction that I didn't know existed.

localRobot.modmanager

I'm not aware of an existing localRobot.modmanager field? Or are you suggesting adding one?

I think @dgottlieb is asking whether he should use localRobot.manager.modmanager.

To @dgottlieb's question, I don't have a good answer. I feel like localRobot and resource manager are already a bit mangled, and I think adding the plumbing will end up making the unmangling slightly more manageable if we ever decide to do it

cheukt · 2025-01-10T15:45:53Z

this is ready for full review! added 2 small-ish tests that test the expected behavior. hopefully they pass on CI or I might just skip them (ran into a lot of issues with the manage goroutine not returning from its Wait() in the goutils tests)

dgottlieb · 2025-01-10T16:32:56Z

web/server/entrypoint.go

+					if robot != nil {
+						robot.Kill()
+					}
+					theRobotLock.Unlock()


I believe this is accidental? Unless Kill is funny, this looks like a double unlock.

thanks for catching!

cheukt · 2025-01-10T16:37:01Z

module/modmanager/manager_test.go

+	mgr.Kill()
+
+	testutils.WaitForAssertion(t, func(tb testing.TB) {
+		test.That(tb, logs.FilterMessageSnippet("Killing module").Len(),


not the most robust test, but using the Status() method didn't work

Using the Status() method didn't work.

Any idea why? Still reported being alive?

err is returning nil, so I assume so. Maybe a quirk of how it works on CI, because the test with Status() worked fine locally

benjirewis

Looking good! Just some final nits...

benjirewis · 2025-01-10T17:39:19Z

module/modmanager/manager.go

@@ -211,6 +211,21 @@ func (mgr *Manager) Close(ctx context.Context) error {
 	return err
 }

+// Kill kills module processes. This is best effort as we do not


Kill kills modules processes.

Per my nits on the other PR, we're technically killing module process groups here. Could we make sure that's at least clear in documentation here and wherever else we're calling managedProcess.Kill and describing what we're doing? The group vs. process thing led to some initial (and perhaps continued) confusion for me, so, while this looks pretty good code-wise, I'd love to not be confused looking back at this stuff.

updated comments

benjirewis · 2025-01-10T17:40:14Z

module/modmanager/manager_test.go

+	mgr.Kill()
+
+	testutils.WaitForAssertion(t, func(tb testing.TB) {
+		test.That(tb, logs.FilterMessageSnippet("Killing module").Len(),


Using the Status() method didn't work.

Any idea why? Still reported being alive?

benjirewis · 2025-01-10T17:42:26Z

robot/impl/local_robot.go

+// Kill will attempt to kill any processes on the system started by the robot as quickly as possible.
+// This operation is not clean and will not wait for completion.
+func (r *localRobot) Kill() {
+	r.manager.Kill()


localRobot.modmanager

I'm not aware of an existing localRobot.modmanager field? Or are you suggesting adding one?

benjirewis · 2025-01-10T17:44:21Z

robot/impl/local_robot_test.go

+func TestKill(t *testing.T) {
+	// this test will not pass in CI as the managed process's manage goroutine
+	// will not return from Wait() and thus fail the goroutine leak detection.
+	t.Skip()


I appreciate that this test exists, but is it even meaningful to have it around? A t.Skip with no TODO to unskip feels wrong to me. Or were you meaning to remove this t.Skip?

it's meaningful to be able to run locally, but added a ticket as well https://viam.atlassian.net/browse/RSDK-9722

benjirewis · 2025-01-10T17:45:22Z

robot/impl/resource_manager.go

@@ -50,6 +51,8 @@ type resourceManager struct {
 	resources      *resource.Graph
 	processManager pexec.ProcessManager
 	processConfigs map[string]pexec.ProcessConfig
+	// modManagerLock controls access to the moduleManager


[nit] Can we just elaborate a bit on comment explaining that a Kill from local robot may try to access the module manager concurrently with resource manager accesses?

benjirewis

Thanks!

kill

9e09821

cheukt requested review from dgottlieb and benjirewis December 27, 2024 16:13

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 27, 2024

cheukt mentioned this pull request Dec 27, 2024

RSDK-9591 - Add KillGroup to ManagedProcess viamrobotics/goutils#399

Merged

dgottlieb reviewed Dec 31, 2024

View reviewed changes

dgottlieb reviewed Jan 6, 2025

View reviewed changes

benjirewis reviewed Jan 8, 2025

View reviewed changes

cheukt added 2 commits January 9, 2025 00:18

Merge branch 'main' into close-faster

03fe57e

pr feedback

ec6c907

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 9, 2025

dgottlieb reviewed Jan 9, 2025

View reviewed changes

add todo

dd8acc7

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 9, 2025

cheukt added 3 commits January 9, 2025 20:28

add test

56a6e0b

change go mod

d300e62

add local robot test

1edec76

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 10, 2025

Merge branch 'main' into close-faster

e0dbd6e

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 10, 2025

cheukt marked this pull request as ready for review January 10, 2025 15:41

cheukt requested review from dgottlieb and benjirewis January 10, 2025 15:46

dgottlieb reviewed Jan 10, 2025

View reviewed changes

fix test

f05f9aa

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 10, 2025

cheukt commented Jan 10, 2025

View reviewed changes

dgottlieb approved these changes Jan 10, 2025

View reviewed changes

benjirewis requested changes Jan 10, 2025

View reviewed changes

cheukt added 2 commits January 10, 2025 15:16

Merge branch 'main' into close-faster

ea477fe

pr feedback

9ac4937

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jan 10, 2025

cheukt requested a review from benjirewis January 10, 2025 20:43

benjirewis approved these changes Jan 10, 2025

View reviewed changes

cheukt merged commit 819123b into viamrobotics:main Jan 10, 2025
16 checks passed

cheukt deleted the close-faster branch January 10, 2025 21:05

		@@ -280,6 +283,9 @@ func (s robotServer) serveWeb(ctx context.Context, cfg config.Config) (err err
		case <-doneServing:

RSDK-9591 - Kill all lingering module process before exiting #4657

RSDK-9591 - Kill all lingering module process before exiting #4657

Conversation

cheukt commented Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheukt commented Jan 10, 2025 • edited Loading

dgottlieb Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjirewis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjirewis left a comment

Choose a reason for hiding this comment

cheukt commented Dec 27, 2024 •

edited

Loading

cheukt commented Jan 10, 2025 •

edited

Loading

dgottlieb Jan 10, 2025 •

edited

Loading