run worker process in launcher pod #612

kuizhiqing · 2024-01-02T11:47:38Z

Feature:
One can run worker process in the launcher pod by declare the same resources requirement for launcher and worker.

Why this ?
For a cluster fully equipped with GPU nodes, launcher pod occupied CPU resources which makes the worker cannot fully take advantage of all the physical resources.

With a cluster with CPU only nodes and GPU nodes, it may also suffer from weak efficiency due to the network between the two in most cases.

How ?
We introduce new entry in the spec as spec.runLauncherAsWorker.

Don't forget to run sshd service in launcher in this mode.

This PR is a part of the project #611.

alculquicondor · 2024-01-02T13:36:24Z

If you really want to put a launcher process in the same node as a worker, you can simply set requests to empty or something small, with the same behavior.

alculquicondor · 2024-01-02T13:36:56Z

Without introducing another configuration, we determine wether to run worker process in the launcher node by checking the resources declaration of launcher and worker. Worker in launcher is enabled if they are set to be equal.

This is not intuitive at all.

kuizhiqing · 2024-01-02T15:24:39Z

If you really want to put a launcher process in the same node as a worker, you can simply set requests to empty or something small, with the same behavior.

Well, there are some reason that worth notice,

In some strict managed platform, a pod without requests declaration may not be allowed to scheduler since it may encounter problem; in this case we can not leave it empty;
If we have a cluster only with GPU nodes, say one jobs takes 1000 nodes, if CPU launcher takes 2 cores, to make job homogeneous, every nodes will reserve 2 cores each which is expansive;
If CPU launcher takes small resources, it is not efficient in the start phase while it waste of resource during the long training phase in most cases;
If we schedule the launcher to CPU only node, the network performance may not efficient since CPU-GPU node network usually bad than GPU-GPU node network.

I'm not sure if the issue #503 require this feature, while it may related to this.

kuizhiqing · 2024-01-02T15:27:55Z

Without introducing another configuration, we determine wether to run worker process in the launcher node by checking the resources declaration of launcher and worker. Worker in launcher is enabled if they are set to be equal.

This is not intuitive at all.

I'm agree that introducing a configure in CRD may be better, but the modification of CRD is somewhat expensive.

While personally, I totally OK with the statement that

There is definitely no need to set the same resources in practice if we do not run worker process in launcher.

So I accept the design.

alculquicondor · 2024-01-02T15:45:46Z

An explicit API will almost always be better. Maybe something like useLauncherAsWorker: true.

But ok, I can understand that your cluster might be entirely homogeneous, in which case a separate pod could be detrimental.

Any concerns @tenzen-y ?

kuizhiqing · 2024-01-03T07:45:11Z

@alculquicondor @tenzen-y
Well, what's your opinion about this feature, should I continue to work with a version introducing useLauncherAsWorker: true or just hold it.

tenzen-y · 2024-01-03T13:14:35Z

@alculquicondor @kuizhiqing First of all, I believe that this feature would be worth it.
We can avoid falling into some rabbit holes due to mixed network protocols (well-known generic Ethernet and RoCE (or Infiniband)) since we could schedule both Launcher and Worker pods to the GPU node with a single network protocol (RoCE or Infiniband).

Actually, in my cluster, the launcher is scheduled to CPU nodes only with Ethernet, and the workers are scheduled to GPU nodes with Ethernet and RoCE networks. Provided that the mixed network protocols, I sometimes fall into rabbit holes :(

My only concern is the situation in which the launcher with the worker pod failed due to a worker process error.
When the job uses horovod with elastic semantics, the point of failure would be increased, and then, the job is possible to lose the elasticity in the above situation. However, we can mention the disadvantage in the documentation to notice users.

Well, what's your opinion about this feature, should I continue to work with a version introducing useLauncherAsWorker: true or just hold it.

I think we should introduce a new field for this feature in this PR.

As an alternative API name, we could add the runLauncherWithinWorker: <bool> and launcherAsWorker: <bool> in .runPolicy.

Or, I think we could create a new field spawnStragtegy under the runPolicy in the following:

runPolicy:
    spawnStrategy:
        launcher: <asWorker (withinWorker)|Independent>

@alculquicondor @kuizhiqing WDYT?

alculquicondor · 2024-01-03T16:59:11Z

I wouldn't touch runPolicy so that it stays the same as the other operators.

The name should match the behavior. What would you rather have?

The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.
There is no launcher job. The worker 0 has special logic in the entry-point.

I think option 1 is cleaner. Option 2 assumes that the operator has control over the entry-point, which is not true. And we would have to re-implement retry semantics, as opposed to using the Job API.

tenzen-y · 2024-01-03T21:07:06Z

I wouldn't touch runPolicy so that it stays the same as the other operators.

The name should match the behavior. What would you rather have?

The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.

There is no launcher job. The worker 0 has special logic in the entry-point.

I think option 1 is cleaner. Option 2 assumes that the operator has control over the entry-point, which is not true. And we would have to re-implement retry semantics, as opposed to using the Job API.

It sounds reasonable. I prefer option 1, too.
@alculquicondor Does option 1 mean that MPIJob has nil replicaSpec for the Worker? Or, the MPIJob has a replicaSpec with replicas=0 for Worker?

alculquicondor · 2024-01-03T21:13:23Z

Does option 1 mean that MPIJob has nil replicaSpec for the Worker?

Yes, optionally it could be nil. Setting replicas=0 would require you to define a Pod spec, which is unnecessary.

kuizhiqing · 2024-01-04T06:06:21Z

@tenzen-y
I think we should not touch runPolicy since other operator or run mode do not have such config.

@alculquicondor

The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.
While I prefer option 1.

OK, let me continue to work on it, tell me if you have more suggestions.

tenzen-y · 2024-01-04T08:50:45Z

Does option 1 mean that MPIJob has nil replicaSpec for the Worker?

Yes, optionally it could be nil. Setting replicas=0 would require you to define a Pod spec, which is unnecessary.

SGTM

kuizhiqing · 2024-01-08T12:06:15Z

@alculquicondor @tenzen-y ready to review, PTAL

kuizhiqing · 2024-01-12T16:26:27Z

/assign @alculquicondor
/assign @tenzen-y

tenzen-y · 2024-01-19T08:27:54Z

/assign @alculquicondor /assign @tenzen-y

ACK
First of all, I can review this PR. Aldo will review this after my review.

tenzen-y · 2024-01-19T08:28:51Z

@kuizhiqing Can you address CI issues?

kuizhiqing · 2024-01-22T07:25:34Z

@kuizhiqing Can you address CI issues?

@tenzen-y Done

tenzen-y · 2024-02-01T20:33:15Z

@kuizhiqing Can you address CI issues?

@tenzen-y Done

Sorry for the late response. I will come back here after the k/k enhancement freeze.
I have a lot of work...

tenzen-y · 2024-02-15T14:35:36Z

I came back here right now.

tenzen-y

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

Further more, could we support the zero explicit workers?

#612 (comment)

pkg/apis/kubeflow/v2beta1/types.go

pkg/controller/mpi_job_controller.go

tenzen-y · 2024-02-15T15:35:34Z

pkg/controller/mpi_job_controller.go

 		switch mpiJob.Spec.MPIImplementation {
 		case kubeflow.MPIImplementationOpenMPI:
-			buffer.WriteString(fmt.Sprintf("%s%s-%d.%s.%s.svc slots=%d\n", mpiJob.Name, workerSuffix, i, workersService, mpiJob.Namespace, slots))
+			buffer.WriteString(fmt.Sprintf("%s.%s.%s.svc slots=%d\n", name, workersService, mpiJob.Namespace, slots))


This is nice refactorings :)

tenzen-y · 2024-02-15T15:38:22Z

pkg/controller/mpi_job_controller.go

@@ -1291,6 +1302,13 @@ func updateDiscoverHostsInConfigMap(configMap *corev1.ConfigMap, mpiJob *kubeflo

 	var buffer bytes.Buffer
 	buffer.WriteString("#!/bin/sh\n")
+
+	// We donnot check if launcher is running here, launcher should always be there or the job failed


Suggested change

// We donnot check if launcher is running here, launcher should always be there or the job failed

// We don't check if launcher is running here, launcher should always be there or the job failed

Also, what happens if we use the LauncherCreationPolicy=WaitForWorkersReady?

Since we leave the user start the launcher process and worker 0 sshd, it do NOT change the workflow.

That makes sense.

examples/v2beta1/pi/pi2.yaml

kuizhiqing · 2024-02-17T16:53:12Z

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

Further more, could we support the zero explicit workers?

#612 (comment)

@tenzen-y Thanks for your time on reviewing. I've addressed your comments, plz let me know if more modifs are needed.

We do support zero explicit workers setting, where I tested with the following example,

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
spec:
  runLauncherAsWorker: true
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 60
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - bash
            args:
            - -c
            - "/usr/sbin/sshd -f /home/mpiuser/.sshd_config && mpirun /home/mpiuser/pi"
            resources:
              limits:
                cpu: 1
                memory: 1Gi

And for the integration test part, after I finished it, I realize that this feature do NOT change the whole workflow as before, so it may take no different, while the unit test does.

Actually, this feature does the two things indeed,

add svc for launcher
add launcher to the hostfile

Note that

the user should take care of the starting sshd service and run the real command in the launcher
in the operator scope, workers do noting but start sshd service
then the introducing of this feature don't really change the RUNNING process but make it possible for the user by altering the configuration, say svc and hostfile.

Signed-off-by: kuizhiqing <[email protected]>

tenzen-y

@kuizhiqing Also, could you address my other comment below?

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

pkg/apis/kubeflow/v2beta1/types.go

pkg/controller/mpi_job_controller.go

tenzen-y · 2024-02-19T10:22:28Z

pkg/controller/mpi_job_controller.go

@@ -1291,6 +1302,13 @@ func updateDiscoverHostsInConfigMap(configMap *corev1.ConfigMap, mpiJob *kubeflo

 	var buffer bytes.Buffer
 	buffer.WriteString("#!/bin/sh\n")
+
+	// We donnot check if launcher is running here, launcher should always be there or the job failed


That makes sense.

pkg/controller/mpi_job_controller_test.go

tenzen-y · 2024-02-19T10:28:53Z

pkg/controller/mpi_job_controller.go

+		if (mpiJob.Spec.RunLauncherAsWorker != nil && *mpiJob.Spec.RunLauncherAsWorker) ||
+			mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationIntel ||
 			mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationMPICH {


Uhm, I'm wondering if we can make this condition another function like workersCanHaveDedicatedService since we're using the same condition in some places.

Thanks for introducing me with ptr.Deref. There is 1 time in controller and 2 times in test now, so it's OK for me to keep it in this PR without adding another function. WDYT.

I'm ok with either way.

pkg/controller/mpi_job_controller.go

pkg/controller/mpi_job_controller_test.go

tenzen-y · 2024-02-19T10:33:00Z

We do support zero explicit workers setting, where I tested with the following example,

Oh, I see. Thank you for the confirmation!

tenzen-y · 2024-02-19T10:35:31Z

And for the integration test part, after I finished it, I realize that this feature do NOT change the whole workflow as before, so it may take no different, while the unit test does.

Actually, this feature does the two things indeed,

add svc for launcher
add launcher to the hostfile
Note that

the user should take care of the starting sshd service and run the real command in the launcher
in the operator scope, workers do noting but start sshd service
then the introducing of this feature don't really change the RUNNING process but make it possible for the user by altering the configuration, say svc and hostile.

Uhm, it sounds reasonable considered in have a unit test, but let me know what @alculquicondor think.

Signed-off-by: kuizhiqing <[email protected]>

tenzen-y · 2024-02-19T13:04:37Z

@kuizhiqing Let me know once building errors are fixed.

Signed-off-by: kuizhiqing <[email protected]>

kuizhiqing · 2024-02-19T14:59:41Z

@kuizhiqing Let me know once building errors are fixed.

@tenzen-y Done

tenzen-y

/lgtm
/assign @alculquicondor

kuizhiqing · 2024-02-20T06:20:19Z

Thx @tenzen-y @alculquicondor , and I'll address the following in another PR

SA1019: "k8s.io/utils/pointer" is deprecated...

tenzen-y · 2024-02-20T06:27:40Z

Thx @tenzen-y @alculquicondor , and I'll address the following in another PR

lgtm

alculquicondor · 2024-02-20T15:55:54Z

pkg/controller/mpi_job_controller.go

@@ -656,13 +657,14 @@ func (c *MPIJobController) syncHandler(key string) error {
 				return err
 			}
 		}
-		if mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationIntel ||
+		// If we want to run process in launcher, we should create a service for launcher.


We shouldn't create an additional service in the case of OpenMPI. Just allowing the existing Service to match the launcher pod should be enough.

The existing Service created with selector training.kubeflow.org/job-role: worker which can't be used directly. Without creating Service for launcher, we should change the service for worker and remove the selector.
Create a service is more preferable for me. WDYT @tenzen-y @alculquicondor

I think it's more efficient to just change the selector.

@alculquicondor Do you mean change the selector in the case of OpenMPI or in all the case ?

I would just change it for all

Yes. Although I'm not sure if IntelMPI would work with just one Service, as it's very picky about the hostname. Worth trying.

OK,I will try it. While we just make the service refactor in this PR or with another one, since it's already a way to run with respect to the original service design?

here should be fine

@alculquicondor @tenzen-y I've add a service for both launcher and workers, for IntelMPI which need to access launcher with hostname, I modify the searches part in DNSConfig.

pkg/controller/mpi_job_controller.go

Signed-off-by: kuizhiqing <[email protected]>

alculquicondor

Implementation looks good, but can you check something manually?

Create a long-running MPIJob using the released controller.
Upgrade the controller to a build containing this PR
Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

kuizhiqing · 2024-02-24T16:04:44Z

Implementation looks good, but can you check something manually?

Create a long-running MPIJob using the released controller.

Upgrade the controller to a build containing this PR

Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

@alculquicondor
I know that you want to make sure that a continuous upgrade should not break the existing job, so I did check it and it worked.
Note that one more svc with the same name as the job will be created for all the existing job, and it will not be used though. For new created job, only the new one svc will be created and be used.

alculquicondor · 2024-02-26T15:13:29Z

Note that one more svc with the same name as the job will be created for all the existing job, and it will not be used though. For new created job, only the new one svc will be created and be used.

That's what I thought, and it's ok.

/approve

@tenzen-y anything else?

google-oss-prow · 2024-02-26T15:13:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2024-02-26T15:13:53Z

ah, @tenzen-y already gave lgtm, so it will merge :)

tenzen-y · 2024-02-26T15:17:17Z

Implementation looks good, but can you check something manually?

Create a long-running MPIJob using the released controller.

Upgrade the controller to a build containing this PR

Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

Maybe we can add an integration test for this situation by enabling and disabling this feature, but I'm ok with working on it at another PR.

tenzen-y

/lgtm

google-oss-prow bot requested review from carmark and gaocegege January 2, 2024 11:47

google-oss-prow bot added the size/L label Jan 2, 2024

kuizhiqing force-pushed the launcher-as-worker branch from 9dfb551 to 2014d2f Compare January 5, 2024 11:14

google-oss-prow bot assigned alculquicondor and tenzen-y Jan 12, 2024

tenzen-y reviewed Feb 15, 2024

View reviewed changes

kuizhiqing force-pushed the launcher-as-worker branch from 2351cdc to 81ef2f9 Compare February 17, 2024 16:17

kuizhiqing force-pushed the launcher-as-worker branch from 81ef2f9 to 2aa5c9d Compare February 17, 2024 16:58

run worker in launcher pod; fix DCO issue

323e1ee

Signed-off-by: kuizhiqing <[email protected]>

kuizhiqing force-pushed the launcher-as-worker branch from 2aa5c9d to 323e1ee Compare February 17, 2024 17:00

tenzen-y reviewed Feb 19, 2024

View reviewed changes

use ptr.Deref

31add86

Signed-off-by: kuizhiqing <[email protected]>

update manifest

224e5e2

Signed-off-by: kuizhiqing <[email protected]>

kuizhiqing force-pushed the launcher-as-worker branch from 98a56b9 to 224e5e2 Compare February 19, 2024 13:08

tenzen-y reviewed Feb 20, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Feb 20, 2024

alculquicondor reviewed Feb 20, 2024

View reviewed changes

pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved

more Deref

934ee87

Signed-off-by: kuizhiqing <[email protected]>

google-oss-prow bot removed the lgtm label Feb 21, 2024

create one service for both launcher and worker

af85ab0

Signed-off-by: kuizhiqing <[email protected]>

kuizhiqing requested review from alculquicondor and tenzen-y February 22, 2024 16:26

alculquicondor reviewed Feb 24, 2024

View reviewed changes

google-oss-prow bot added the approved label Feb 26, 2024

tenzen-y reviewed Feb 26, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Feb 26, 2024

google-oss-prow bot merged commit a6c2da8 into kubeflow:master Feb 26, 2024
11 checks passed

	// We donnot check if launcher is running here, launcher should always be there or the job failed
	// We don't check if launcher is running here, launcher should always be there or the job failed

run worker process in launcher pod #612

run worker process in launcher pod #612

Conversation

kuizhiqing commented Jan 2, 2024 • edited Loading

alculquicondor commented Jan 2, 2024 • edited Loading

alculquicondor commented Jan 2, 2024

kuizhiqing commented Jan 2, 2024

kuizhiqing commented Jan 2, 2024

alculquicondor commented Jan 2, 2024

kuizhiqing commented Jan 3, 2024

tenzen-y commented Jan 3, 2024

alculquicondor commented Jan 3, 2024 • edited Loading

tenzen-y commented Jan 3, 2024

alculquicondor commented Jan 3, 2024

kuizhiqing commented Jan 4, 2024

tenzen-y commented Jan 4, 2024

kuizhiqing commented Jan 8, 2024

kuizhiqing commented Jan 12, 2024

tenzen-y commented Jan 19, 2024

tenzen-y commented Jan 19, 2024

kuizhiqing commented Jan 22, 2024 • edited Loading

tenzen-y commented Feb 1, 2024

tenzen-y commented Feb 15, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuizhiqing commented Feb 17, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 19, 2024

tenzen-y commented Feb 19, 2024

tenzen-y commented Feb 19, 2024

kuizhiqing commented Feb 19, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

kuizhiqing commented Feb 20, 2024

tenzen-y commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuizhiqing Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

kuizhiqing commented Feb 24, 2024

alculquicondor commented Feb 26, 2024

google-oss-prow bot commented Feb 26, 2024

alculquicondor commented Feb 26, 2024

tenzen-y commented Feb 26, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

kuizhiqing commented Jan 2, 2024 •

edited

Loading

alculquicondor commented Jan 2, 2024 •

edited

Loading

alculquicondor commented Jan 3, 2024 •

edited

Loading

kuizhiqing commented Jan 22, 2024 •

edited

Loading

kuizhiqing Feb 22, 2024 •

edited

Loading