Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run worker process in launcher pod #612

Merged
merged 5 commits into from
Feb 26, 2024

Conversation

kuizhiqing
Copy link
Member

@kuizhiqing kuizhiqing commented Jan 2, 2024

Feature:
One can run worker process in the launcher pod by declare the same resources requirement for launcher and worker.

Why this ?
For a cluster fully equipped with GPU nodes, launcher pod occupied CPU resources which makes the worker cannot fully take advantage of all the physical resources.

With a cluster with CPU only nodes and GPU nodes, it may also suffer from weak efficiency due to the network between the two in most cases.

How ?
We introduce new entry in the spec as spec.runLauncherAsWorker.

Don't forget to run sshd service in launcher in this mode.

This PR is a part of the project #611.

@alculquicondor
Copy link
Collaborator

alculquicondor commented Jan 2, 2024

If you really want to put a launcher process in the same node as a worker, you can simply set requests to empty or something small, with the same behavior.

@alculquicondor
Copy link
Collaborator

Without introducing another configuration, we determine wether to run worker process in the launcher node by checking the resources declaration of launcher and worker. Worker in launcher is enabled if they are set to be equal.

This is not intuitive at all.

@kuizhiqing
Copy link
Member Author

If you really want to put a launcher process in the same node as a worker, you can simply set requests to empty or something small, with the same behavior.

Well, there are some reason that worth notice,

  1. In some strict managed platform, a pod without requests declaration may not be allowed to scheduler since it may encounter problem; in this case we can not leave it empty;
  2. If we have a cluster only with GPU nodes, say one jobs takes 1000 nodes, if CPU launcher takes 2 cores, to make job homogeneous, every nodes will reserve 2 cores each which is expansive;
  3. If CPU launcher takes small resources, it is not efficient in the start phase while it waste of resource during the long training phase in most cases;
  4. If we schedule the launcher to CPU only node, the network performance may not efficient since CPU-GPU node network usually bad than GPU-GPU node network.

I'm not sure if the issue #503 require this feature, while it may related to this.

@kuizhiqing
Copy link
Member Author

Without introducing another configuration, we determine wether to run worker process in the launcher node by checking the resources declaration of launcher and worker. Worker in launcher is enabled if they are set to be equal.

This is not intuitive at all.

I'm agree that introducing a configure in CRD may be better, but the modification of CRD is somewhat expensive.

While personally, I totally OK with the statement that

There is definitely no need to set the same resources in practice if we do not run worker process in launcher.

So I accept the design.

@alculquicondor
Copy link
Collaborator

An explicit API will almost always be better. Maybe something like useLauncherAsWorker: true.

But ok, I can understand that your cluster might be entirely homogeneous, in which case a separate pod could be detrimental.

Any concerns @tenzen-y ?

@kuizhiqing
Copy link
Member Author

@alculquicondor @tenzen-y
Well, what's your opinion about this feature, should I continue to work with a version introducing useLauncherAsWorker: true or just hold it.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 3, 2024

@alculquicondor @kuizhiqing First of all, I believe that this feature would be worth it.
We can avoid falling into some rabbit holes due to mixed network protocols (well-known generic Ethernet and RoCE (or Infiniband)) since we could schedule both Launcher and Worker pods to the GPU node with a single network protocol (RoCE or Infiniband).

Actually, in my cluster, the launcher is scheduled to CPU nodes only with Ethernet, and the workers are scheduled to GPU nodes with Ethernet and RoCE networks. Provided that the mixed network protocols, I sometimes fall into rabbit holes :(

My only concern is the situation in which the launcher with the worker pod failed due to a worker process error.
When the job uses horovod with elastic semantics, the point of failure would be increased, and then, the job is possible to lose the elasticity in the above situation. However, we can mention the disadvantage in the documentation to notice users.

Well, what's your opinion about this feature, should I continue to work with a version introducing useLauncherAsWorker: true or just hold it.

I think we should introduce a new field for this feature in this PR.

As an alternative API name, we could add the runLauncherWithinWorker: <bool> and launcherAsWorker: <bool> in .runPolicy.

Or, I think we could create a new field spawnStragtegy under the runPolicy in the following:

runPolicy:
    spawnStrategy:
        launcher: <asWorker (withinWorker)|Independent>

@alculquicondor @kuizhiqing WDYT?

@alculquicondor
Copy link
Collaborator

alculquicondor commented Jan 3, 2024

I wouldn't touch runPolicy so that it stays the same as the other operators.

The name should match the behavior. What would you rather have?

  1. The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.
  2. There is no launcher job. The worker 0 has special logic in the entry-point.

I think option 1 is cleaner. Option 2 assumes that the operator has control over the entry-point, which is not true. And we would have to re-implement retry semantics, as opposed to using the Job API.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 3, 2024

I wouldn't touch runPolicy so that it stays the same as the other operators.

The name should match the behavior. What would you rather have?

  1. The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.
  2. There is no launcher job. The worker 0 has special logic in the entry-point.

I think option 1 is cleaner. Option 2 assumes that the operator has control over the entry-point, which is not true. And we would have to re-implement retry semantics, as opposed to using the Job API.

It sounds reasonable. I prefer option 1, too.
@alculquicondor Does option 1 mean that MPIJob has nil replicaSpec for the Worker? Or, the MPIJob has a replicaSpec with replicas=0 for Worker?

@alculquicondor
Copy link
Collaborator

Does option 1 mean that MPIJob has nil replicaSpec for the Worker?

Yes, optionally it could be nil. Setting replicas=0 would require you to define a Pod spec, which is unnecessary.

@kuizhiqing
Copy link
Member Author

@tenzen-y
I think we should not touch runPolicy since other operator or run mode do not have such config.

@alculquicondor

  1. The launcher is added to the hostfile, effectively making it an additional worker. We should allow to have zero explicit workers in this case.
    While I prefer option 1.

OK, let me continue to work on it, tell me if you have more suggestions.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 4, 2024

Does option 1 mean that MPIJob has nil replicaSpec for the Worker?

Yes, optionally it could be nil. Setting replicas=0 would require you to define a Pod spec, which is unnecessary.

SGTM

@kuizhiqing
Copy link
Member Author

@alculquicondor @tenzen-y ready to review, PTAL

@kuizhiqing
Copy link
Member Author

/assign @alculquicondor
/assign @tenzen-y

@tenzen-y
Copy link
Member

/assign @alculquicondor /assign @tenzen-y

ACK
First of all, I can review this PR. Aldo will review this after my review.

@tenzen-y
Copy link
Member

@kuizhiqing Can you address CI issues?

@kuizhiqing
Copy link
Member Author

kuizhiqing commented Jan 22, 2024

@kuizhiqing Can you address CI issues?

@tenzen-y Done

@tenzen-y
Copy link
Member

tenzen-y commented Feb 1, 2024

@kuizhiqing Can you address CI issues?

@tenzen-y Done

Sorry for the late response. I will come back here after the k/k enhancement freeze.
I have a lot of work...

@tenzen-y
Copy link
Member

I came back here right now.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

Further more, could we support the zero explicit workers?

#612 (comment)

pkg/apis/kubeflow/v2beta1/types.go Outdated Show resolved Hide resolved
pkg/apis/kubeflow/v2beta1/types.go Outdated Show resolved Hide resolved
pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved
switch mpiJob.Spec.MPIImplementation {
case kubeflow.MPIImplementationOpenMPI:
buffer.WriteString(fmt.Sprintf("%s%s-%d.%s.%s.svc slots=%d\n", mpiJob.Name, workerSuffix, i, workersService, mpiJob.Namespace, slots))
buffer.WriteString(fmt.Sprintf("%s.%s.%s.svc slots=%d\n", name, workersService, mpiJob.Namespace, slots))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice refactorings :)

@@ -1291,6 +1302,13 @@ func updateDiscoverHostsInConfigMap(configMap *corev1.ConfigMap, mpiJob *kubeflo

var buffer bytes.Buffer
buffer.WriteString("#!/bin/sh\n")

// We donnot check if launcher is running here, launcher should always be there or the job failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We donnot check if launcher is running here, launcher should always be there or the job failed
// We don't check if launcher is running here, launcher should always be there or the job failed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens if we use the LauncherCreationPolicy=WaitForWorkersReady?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we leave the user start the launcher process and worker 0 sshd, it do NOT change the workflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

examples/v2beta1/pi/pi2.yaml Outdated Show resolved Hide resolved
@kuizhiqing
Copy link
Member Author

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

Further more, could we support the zero explicit workers?

#612 (comment)

@tenzen-y Thanks for your time on reviewing. I've addressed your comments, plz let me know if more modifs are needed.

We do support zero explicit workers setting, where I tested with the following example,

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
spec:
  runLauncherAsWorker: true
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 60
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - bash
            args:
            - -c
            - "/usr/sbin/sshd -f /home/mpiuser/.sshd_config && mpirun /home/mpiuser/pi"
            resources:
              limits:
                cpu: 1
                memory: 1Gi

And for the integration test part, after I finished it, I realize that this feature do NOT change the whole workflow as before, so it may take no different, while the unit test does.

Actually, this feature does the two things indeed,

  • add svc for launcher
  • add launcher to the hostfile

Note that

  • the user should take care of the starting sshd service and run the real command in the launcher
  • in the operator scope, workers do noting but start sshd service
    then the introducing of this feature don't really change the RUNNING process but make it possible for the user by altering the configuration, say svc and hostfile.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing Also, could you address my other comment below?

Also, could you implement an integration test here: https://github.com/kubeflow/mpi-operator/blob/master/test/integration/mpi_job_controller_test.go?

pkg/apis/kubeflow/v2beta1/types.go Outdated Show resolved Hide resolved
pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved
@@ -1291,6 +1302,13 @@ func updateDiscoverHostsInConfigMap(configMap *corev1.ConfigMap, mpiJob *kubeflo

var buffer bytes.Buffer
buffer.WriteString("#!/bin/sh\n")

// We donnot check if launcher is running here, launcher should always be there or the job failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

pkg/controller/mpi_job_controller_test.go Outdated Show resolved Hide resolved
pkg/controller/mpi_job_controller_test.go Outdated Show resolved Hide resolved
Comment on lines 662 to 664
if (mpiJob.Spec.RunLauncherAsWorker != nil && *mpiJob.Spec.RunLauncherAsWorker) ||
mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationIntel ||
mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationMPICH {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I'm wondering if we can make this condition another function like workersCanHaveDedicatedService since we're using the same condition in some places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for introducing me with ptr.Deref. There is 1 time in controller and 2 times in test now, so it's OK for me to keep it in this PR without adding another function. WDYT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with either way.

pkg/controller/mpi_job_controller.go Outdated Show resolved Hide resolved
pkg/controller/mpi_job_controller_test.go Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

We do support zero explicit workers setting, where I tested with the following example,

Oh, I see. Thank you for the confirmation!

@tenzen-y
Copy link
Member

And for the integration test part, after I finished it, I realize that this feature do NOT change the whole workflow as before, so it may take no different, while the unit test does.

Actually, this feature does the two things indeed,

add svc for launcher
add launcher to the hostfile
Note that

the user should take care of the starting sshd service and run the real command in the launcher
in the operator scope, workers do noting but start sshd service
then the introducing of this feature don't really change the RUNNING process but make it possible for the user by altering the configuration, say svc and hostile.

Uhm, it sounds reasonable considered in have a unit test, but let me know what @alculquicondor think.

Signed-off-by: kuizhiqing <[email protected]>
@tenzen-y
Copy link
Member

@kuizhiqing Let me know once building errors are fixed.

Signed-off-by: kuizhiqing <[email protected]>
@kuizhiqing
Copy link
Member Author

@kuizhiqing Let me know once building errors are fixed.

@tenzen-y Done

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @alculquicondor

@google-oss-prow google-oss-prow bot added the lgtm label Feb 20, 2024
@kuizhiqing
Copy link
Member Author

Thx @tenzen-y @alculquicondor , and I'll address the following in another PR

SA1019: "k8s.io/utils/pointer" is deprecated...

@tenzen-y
Copy link
Member

Thx @tenzen-y @alculquicondor , and I'll address the following in another PR

lgtm

@@ -656,13 +657,14 @@ func (c *MPIJobController) syncHandler(key string) error {
return err
}
}
if mpiJob.Spec.MPIImplementation == kubeflow.MPIImplementationIntel ||
// If we want to run process in launcher, we should create a service for launcher.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create an additional service in the case of OpenMPI. Just allowing the existing Service to match the launcher pod should be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing Service created with selector training.kubeflow.org/job-role: worker which can't be used directly. Without creating Service for launcher, we should change the service for worker and remove the selector.
Create a service is more preferable for me. WDYT @tenzen-y @alculquicondor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more efficient to just change the selector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor Do you mean change the selector in the case of OpenMPI or in all the case ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just change it for all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although I'm not sure if IntelMPI would work with just one Service, as it's very picky about the hostname. Worth trying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,I will try it. While we just make the service refactor in this PR or with another one, since it's already a way to run with respect to the original service design?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member Author

@kuizhiqing kuizhiqing Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor @tenzen-y I've add a service for both launcher and workers, for IntelMPI which need to access launcher with hostname, I modify the searches part in DNSConfig.

Signed-off-by: kuizhiqing <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 21, 2024
Copy link
Collaborator

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, but can you check something manually?

  1. Create a long-running MPIJob using the released controller.
  2. Upgrade the controller to a build containing this PR
  3. Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

@kuizhiqing
Copy link
Member Author

Implementation looks good, but can you check something manually?

  1. Create a long-running MPIJob using the released controller.
  2. Upgrade the controller to a build containing this PR
  3. Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

@alculquicondor
I know that you want to make sure that a continuous upgrade should not break the existing job, so I did check it and it worked.
Note that one more svc with the same name as the job will be created for all the existing job, and it will not be used though. For new created job, only the new one svc will be created and be used.

@alculquicondor
Copy link
Collaborator

Note that one more svc with the same name as the job will be created for all the existing job, and it will not be used though. For new created job, only the new one svc will be created and be used.

That's what I thought, and it's ok.

/approve

@tenzen-y anything else?

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Collaborator

ah, @tenzen-y already gave lgtm, so it will merge :)

@tenzen-y
Copy link
Member

Implementation looks good, but can you check something manually?

  1. Create a long-running MPIJob using the released controller.
  2. Upgrade the controller to a build containing this PR
  3. Check that the MPIJob continues running.

I believe it should continue to run, but let's better be safe.

Maybe we can add an integration test for this situation by enabling and disabling this feature, but I'm ok with working on it at another PR.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Feb 26, 2024
@google-oss-prow google-oss-prow bot merged commit a6c2da8 into kubeflow:master Feb 26, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants