Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use mixin class for osu #222

Merged
merged 8 commits into from
Jan 21, 2025
Merged

use mixin class for osu #222

merged 8 commits into from
Jan 21, 2025

Conversation

smoors
Copy link
Collaborator

@smoors smoors commented Dec 30, 2024

EESSI_Mixin class:

  • change measure_mem_usage() to @run_before('setup', always_last=True) to avoid perf_variables being overwritten by the hpctestlib
  • set default num_tasks_per_compute_unit

OSU test:

  • use mixin class
  • create base class to minimize code duplication
  • reduce required memory per node to 1GB per task
  • add partial node scales for the pt2pt GPU test and skip the ones that don't have exactly 2 GPUs

fixes #145

@smoors smoors added this to the v0.4.1 milestone Dec 30, 2024
from eessi.testsuite.utils import find_modules, log


def filter_scales_pt2pt():
"""
Filtering function for filtering scales for the pt2pt OSU test
returns all scales with either 2 cores, 1 full node, or 2 full nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit surprised that it runs on 2 cores, full node and 2 full nodes, but not e.g. on 2 GPUs (which on Snellius happens to be 1_2_node). This was already the case before your changes here, so it's not introduced in this PR, but maybe @satishskamath can comment on why this is the case? I do seem to remember him saying he thought exclusive nodes would be good for reproducibility for these tests, but according to that argument also the 2-cores size would have been filtered - I'd say having 2 GPUs is the equivalent .

Now, I do think it is difficult to filter the GPUs down to 2-GPU setups, as you don't know this until after the setup phase. I.e. we'd basically have to accept everything single-node in the init phase, and then skip the ones that would provide more than 2 GPUs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this in a follow-up PR, see #224

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: i decided to add it to this PR anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial attempt on this and the reason I kept 2_cores and 1cpn_2nodes case for GPUs for point to point tests was because these cases were valid and one needs at least one host CPU to assign to a device. But these scales are mainly for CPUs therefore can be a bit misleading and I do lean towards the fact that network based tests are performed the best using two full exclusive nodes or 1 full node which is the objective of this test as well.

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, runs fine. My main question would be the required_mem_per_node remark, otherwise, I think this is ready to be merged.

@casparvl
Copy link
Collaborator

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:

[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Now, it makes total sense to use

compute_unit = COMPUTE_UNIT[NODE]

and set

    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2

in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

@casparvl
Copy link
Collaborator

Note that it is not just important for the skip_if: we need to set

    @run_after('setup')
    def set_gpus_per_node(self):
        self.num_gpus_per_node = self.default_num_gpus_per_node

in order for those GPUs to even be requested by the job... otherwise, you'd get a CPU-only allocation on a GPU node.

Even more accurate might be:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        self.skip_if(
            num_gpus != 2 and self.scale not in ['1_node', '2_nodes'],
            f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
       elif self.scale in ['1_node']:
           self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
      else:
           self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

For collectives, there is no problem, because there we map the DEVICE_TYPES[GPU] to COMPUTE_UNIT[GPU], thus defining the self.num_gpus_per_node just fine.

@casparvl
Copy link
Collaborator

Ok, this doesn't account for the fact that the node could have only 1 GPU per node. We should do:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        # On a partial node allocation, run this test only if exactly 2 GPUs are allocated
        if self.scale not in ['1_node', '2_nodes']:
            self.skip_if(
                num_gpus != 2,
                f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )
        # If the scale is 1_node, make sure there are at least 2 GPUs
        elif self.scale == '1_node':
            self.skip_if(num_gpus < 2, "Skipping GPU test : only 1 GPU available for this test case")

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
        elif self.scale == '1_node':
            self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
        else:
            self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

@casparvl
Copy link
Collaborator

Note: I made a PR to your feature branch @smoors

@satishskamath
Copy link
Collaborator

satishskamath commented Jan 16, 2025

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:

[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Now, it makes total sense to use

compute_unit = COMPUTE_UNIT[NODE]

and set

    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2

in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

@casparvl In the version in the main branch, I used to set this in the test itself.

@smoors
Copy link
Collaborator Author

smoors commented Jan 19, 2025

@casparvl good catch!
your proposed solution should work, but i think there's a more elegant solution that may also be useful for other tests.

i'll update this PR and copy the other improvements in your PR that are needed or useful.

EDIT: see c33114e, which adds an extra variable always_request_gpus, which can be set by GPU-only tests to ensure GPUs are always requested.

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We should realize though that setting always_request_gpus on scales for which node_part is not set means the maximum number of GPUs will be requested. E.g. for a 1_core scale, you'd still request all GPUs.

It doesn't matter for this test, we only request sizes for pt2pt that have node_part set. So all good. But just to keep in mind.

@casparvl casparvl merged commit 6785a0d into EESSI:main Jan 21, 2025
17 checks passed
@smoors
Copy link
Collaborator Author

smoors commented Jan 21, 2025

Looks good to me. We should realize though that setting always_request_gpus on scales for which node_part is not set means the maximum number of GPUs will be requested. E.g. for a 1_core scale, you'd still request all GPUs.

It doesn't matter for this test, we only request sizes for pt2pt that have node_part set. So all good. But just to keep in mind.

hm, indeed not an issue for this test, but i'll see if we can fix this so it won't bite us later on.

EDIT: checking the hooks code, this shouldn't happen because scale 1_core has num_gpus_per_node defined, and set_tag_scale() sets default_num_gpus_per_node to this value if defined. to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OSU pt2pt test don't run on node-parts for GPUs.
3 participants