use mixin class for osu #222

smoors · 2024-12-30T19:53:18Z

EESSI_Mixin class:

change measure_mem_usage() to @run_before('setup', always_last=True) to avoid perf_variables being overwritten by the hpctestlib
set default num_tasks_per_compute_unit

OSU test:

use mixin class
create base class to minimize code duplication
reduce required memory per node to 1GB per task
add partial node scales for the pt2pt GPU test and skip the ones that don't have exactly 2 GPUs

fixes #145

eessi/testsuite/tests/apps/osu.py

casparvl · 2025-01-09T14:14:50Z

eessi/testsuite/tests/apps/osu.py

 from eessi.testsuite.utils import find_modules, log


 def filter_scales_pt2pt():
    """
    Filtering function for filtering scales for the pt2pt OSU test
+    returns all scales with either 2 cores, 1 full node, or 2 full nodes


I was a bit surprised that it runs on 2 cores, full node and 2 full nodes, but not e.g. on 2 GPUs (which on Snellius happens to be 1_2_node). This was already the case before your changes here, so it's not introduced in this PR, but maybe @satishskamath can comment on why this is the case? I do seem to remember him saying he thought exclusive nodes would be good for reproducibility for these tests, but according to that argument also the 2-cores size would have been filtered - I'd say having 2 GPUs is the equivalent .

Now, I do think it is difficult to filter the GPUs down to 2-GPU setups, as you don't know this until after the setup phase. I.e. we'd basically have to accept everything single-node in the init phase, and then skip the ones that would provide more than 2 GPUs.

will fix this in a follow-up PR, see #224

edit: i decided to add it to this PR anyway.

My initial attempt on this and the reason I kept 2_cores and 1cpn_2nodes case for GPUs for point to point tests was because these cases were valid and one needs at least one host CPU to assign to a device. But these scales are mainly for CPUs therefore can be a bit misleading and I do lean towards the fact that network based tests are performed the best using two full exclusive nodes or 1 full node which is the objective of this test as well.

casparvl

Tested, runs fine. My main question would be the required_mem_per_node remark, otherwise, I think this is ready to be merged.

eessi/testsuite/tests/apps/osu.py

casparvl · 2025-01-14T11:14:35Z

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:

[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Now, it makes total sense to use

compute_unit = COMPUTE_UNIT[NODE]

and set

    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2

in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

casparvl · 2025-01-14T11:31:04Z

Note that it is not just important for the skip_if: we need to set

    @run_after('setup')
    def set_gpus_per_node(self):
        self.num_gpus_per_node = self.default_num_gpus_per_node

in order for those GPUs to even be requested by the job... otherwise, you'd get a CPU-only allocation on a GPU node.

Even more accurate might be:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        self.skip_if(
            num_gpus != 2 and self.scale not in ['1_node', '2_nodes'],
            f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
       elif self.scale in ['1_node']:
           self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
      else:
           self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

For collectives, there is no problem, because there we map the DEVICE_TYPES[GPU] to COMPUTE_UNIT[GPU], thus defining the self.num_gpus_per_node just fine.

casparvl · 2025-01-14T11:53:27Z

Ok, this doesn't account for the fact that the node could have only 1 GPU per node. We should do:

    @run_after('setup')
    def skip_test_1gpu(self):
        num_gpus = self.default_num_gpus_per_node * self.num_nodes
        # On a partial node allocation, run this test only if exactly 2 GPUs are allocated
        if self.scale not in ['1_node', '2_nodes']:
            self.skip_if(
                num_gpus != 2,
                f"Skipping test : {num_gpus} GPU(s) available for this test case, need exactly 2"
        )
        # If the scale is 1_node, make sure there are at least 2 GPUs
        elif self.scale == '1_node':
            self.skip_if(num_gpus < 2, "Skipping GPU test : only 1 GPU available for this test case")

    @run_after('setup')
    def set_gpus_per_node(self):
        if self.scale not in ['1_node', '2_nodes']:
            self.num_gpus_per_node = 2  # make sure to allocate 2 GPUs
        elif self.scale == '1_node':
            self.num_gpus_per_node = self.default_num_gpus_per_node  # just allocate exclusively
        else:
            self.num_gpus_per_node = 1  # for multinode tests, we use 1 GPU per node

casparvl · 2025-01-15T10:24:31Z

Note: I made a PR to your feature branch @smoors

satishskamath · 2025-01-16T13:23:01Z

I found an issue with the OSU test that I didn't realize was there before. It does NOT pop up if you have the ALWAYS_REQUEST_GPUS feature set. But if you don't, then test.num_gpus_per_node doesn't get set anywhere, and that results in this error:
[2025-01-14T11:45:49] info: reframe:   * Reason: type error: ../test-suite/eessi/testsuite/tests/apps/osu.py:151: unsupported operand type(s) for *: 'NoneType' and 'int'
        num_gpus = self.num_gpus_per_node * self.num_nodes

[2025-01-14T11:45:49] verbose: reframe: Traceback (most recent call last):
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/frontend/executors/__init__.py", line 318, in _safe_call
    return fn(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/gpfs/home4/casparl/EESSI/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/osu.py", line 151, in skip_test_1gpu
    num_gpus = self.num_gpus_per_node * self.num_nodes
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
Now, it makes total sense to use
compute_unit = COMPUTE_UNIT[NODE]
and set
    @run_after('init')
    def set_num_tasks_per_compute_unit(self):
        """ Setting number of tasks per compute unit and cpus per task. This sets num_cpus_per_task
        for 1 node and 2 node options where the request is for full nodes."""
        if SCALES.get(self.scale).get('num_nodes') == 1:
            self.num_tasks_per_compute_unit = 2
in order to make sure the right amount of tasks is launched per node. So, I'm a bit at a loss where we messed this up. Should we somehow have this test set the test.num_gpus_per_node=test.default_num_gpus_per_node manually or something, now that the hook won't do it? Or should we consider this a 'bug' in the hook? I don't think it is a bug in the hook tbh, it had no specific reason to explicitly ask for a num_gpus_per_node, as we didn't define `ALWAYS_REQUEST_GPUS... I have the feeling it's the tests responsibility to say that 'yeah, I'm asking for node as a compute unit BUT I still want to see how many gpus per node I'm expecting to get, and skip if that's less than only 1'.

@casparvl In the version in the main branch, I used to set this in the test itself.

smoors · 2025-01-19T13:46:03Z

@casparvl good catch!
your proposed solution should work, but i think there's a more elegant solution that may also be useful for other tests.

i'll update this PR and copy the other improvements in your PR that are needed or useful.

EDIT: see c33114e, which adds an extra variable always_request_gpus, which can be set by GPU-only tests to ensure GPUs are always requested.

casparvl

Looks good to me. We should realize though that setting always_request_gpus on scales for which node_part is not set means the maximum number of GPUs will be requested. E.g. for a 1_core scale, you'd still request all GPUs.

It doesn't matter for this test, we only request sizes for pt2pt that have node_part set. So all good. But just to keep in mind.

smoors · 2025-01-21T08:03:28Z

Looks good to me. We should realize though that setting always_request_gpus on scales for which node_part is not set means the maximum number of GPUs will be requested. E.g. for a 1_core scale, you'd still request all GPUs.

It doesn't matter for this test, we only request sizes for pt2pt that have node_part set. So all good. But just to keep in mind.

hm, indeed not an issue for this test, but i'll see if we can fix this so it won't bite us later on.

EDIT: checking the hooks code, this shouldn't happen because scale 1_core has num_gpus_per_node defined, and set_tag_scale() sets default_num_gpus_per_node to this value if defined. to verify.

use mixin class for osu

40dae87

smoors added this to the v0.4.1 milestone Dec 30, 2024

Samuel Moors added 3 commits December 30, 2024 21:01

import run_before

2905154

initialize compute_unit in base class

21de9d9

don't inherit from mixin in base class

06e808a

casparvl reviewed Jan 9, 2025

View reviewed changes

eessi/testsuite/tests/apps/osu.py Outdated Show resolved Hide resolved

casparvl reviewed Jan 9, 2025

View reviewed changes

Samuel Moors added 3 commits January 9, 2025 17:22

use 1GB memory per task

e905e28

Merge branch 'main' into osu_mixin

84f213f

add partial node scales to the pt2pt gpu tests

f27ba00

casparvl reviewed Jan 13, 2025

View reviewed changes

eessi/testsuite/tests/apps/osu.py Show resolved Hide resolved

always request gpus for the gpu-only pt2pt test

c33114e

casparvl approved these changes Jan 21, 2025

View reviewed changes

casparvl merged commit 6785a0d into EESSI:main Jan 21, 2025
17 checks passed

casparvl mentioned this pull request Jan 21, 2025

Fix test mapping for OSU test that got renamed in the EESSI test suite EESSI/software-layer#872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use mixin class for osu #222

use mixin class for osu #222

smoors commented Dec 30, 2024 •

edited

Loading

casparvl Jan 9, 2025

smoors Jan 11, 2025

smoors Jan 12, 2025

satishskamath Jan 14, 2025

casparvl left a comment

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 15, 2025

satishskamath commented Jan 16, 2025 •

edited

Loading

smoors commented Jan 19, 2025 •

edited

Loading

casparvl left a comment

smoors commented Jan 21, 2025 •

edited

Loading

use mixin class for osu #222

use mixin class for osu #222

Conversation

smoors commented Dec 30, 2024 • edited Loading

casparvl Jan 9, 2025

Choose a reason for hiding this comment

smoors Jan 11, 2025

Choose a reason for hiding this comment

smoors Jan 12, 2025

Choose a reason for hiding this comment

satishskamath Jan 14, 2025

Choose a reason for hiding this comment

casparvl left a comment

Choose a reason for hiding this comment

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 14, 2025

casparvl commented Jan 15, 2025

satishskamath commented Jan 16, 2025 • edited Loading

smoors commented Jan 19, 2025 • edited Loading

casparvl left a comment

Choose a reason for hiding this comment

smoors commented Jan 21, 2025 • edited Loading

smoors commented Dec 30, 2024 •

edited

Loading

satishskamath commented Jan 16, 2025 •

edited

Loading

smoors commented Jan 19, 2025 •

edited

Loading

smoors commented Jan 21, 2025 •

edited

Loading