Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflow bug in printing of Legendre polynomial arrays #202

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

samhatfield
Copy link
Collaborator

For large problems e.g. TCO2559, the sizes (in bytes) of arrays FG%ZAA and FG%ZAS can be reported as negative in SETUP_TRANS. This is because the SIZE function returns a 4-byte signed integer value, maximum about 2 billion, by default. For certain configurations the sizes of ZAS and ZAA (in number of elements) can go above this value, leading to an integer overflow. This number is then multiplied by the result of C_SIZEOF which I think returns a C_SIZE_T. It seems that the result of this product between two different integer types is an 8-byte integer, for some reason.

This PR fixes this bug by requesting a wider integer value from the SIZE function.

How to reproduce

Write a simple program that only does a setup:

PROGRAM TEST_PROGRAM

USE PARKIND1, ONLY: JPIM, JPRB
USE MPL_MODULE, ONLY: MPL_INIT, MPL_MYRANK, MPL_NPROC

IMPLICIT NONE

INTEGER(KIND=JPIM) :: NPROC, MYPROC, NPRTRV, NPRTRW, ISQR, JA, IB, NPRGPNS, NPRGPEW

INTEGER(JPIM) :: TRUNC = 2559

#include "setup_trans0.h"
#include "setup_trans.h"

CALL MPL_INIT

MYPROC = MPL_MYRANK()
NPROC  = MPL_NPROC()

! Only output to stdout on first task
IF (NPROC > 1) THEN
    IF (MYPROC /= 1) THEN
      OPEN(UNIT=6, FILE='/dev/null')
    ENDIF
ENDIF

ISQR = INT(SQRT(REAL(NPROC,JPRB)))
DO JA = ISQR, NPROC
  IB = NPROC/JA
  IF (JA*IB == NPROC) THEN
    NPRGPNS = MAX(JA,IB)
    NPRGPEW = MIN(JA,IB)
    EXIT
  ENDIF
ENDDO

NPRTRV = NPROC
NPRTRW = NPROC / NPRTRV

WRITE(6,*) "NPROC = ", NPROC
WRITE(6,*) "NPRGPNS X NPRGPEW", NPRGPNS, NPRGPEW
WRITE(6,*) "NPRTRW X NPRTRV= ", NPRTRW, NPRTRV

! Initialise ecTrans (resolution-agnostic aspects)
CALL SETUP_TRANS0(LDMPOFF=.FALSE., KPRTRW=NPRTRW, KPRGPNS=NPRGPNS, KPRGPEW=NPRGPEW)

! Initialise ecTrans (resolution-specific aspects)
CALL SETUP_TRANS(KSMAX=TRUNC, KDGL=2 * (TRUNC + 1), LDUSERPNM=.FALSE.)

END PROGRAM TEST_PROGRAM

This sets NPRTRV to the number of MPI tasks which maximises the memory consumption of the Legendre polynomials on each task.

Run this test with, e.g. 256 MPI tasks (which needs about 128 nodes on the ECMWF HPC to avoid running out of memory).

The test can be run even without GPUs available by commenting out all OpenACC or OpenMP target directives from SETUP_TRANS.

The following will be printed to stdout:

    FG%ZAS: -297877504B
    FG%ZAA: -297877504B
   FG%ZAS0:   26234880B
   FG%ZAA0:   26214400B
 FG%ZEPSNM:   26234880B

After this PR, the following will be printed:

    FG%ZAS:  16881991680B
    FG%ZAA:  16881991680B
   FG%ZAS0:     26234880B
   FG%ZAA0:     26214400B
 FG%ZEPSNM:     26234880B

@samhatfield samhatfield added the bug Something isn't working label Jan 27, 2025
@samhatfield samhatfield requested a review from lukasm91 January 27, 2025 17:00
Copy link
Collaborator

@lukasm91 lukasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, good catch!

@marsdeno
Copy link
Collaborator

This looks fine to me, but possibly naive question, why don't we just do something like
STORAGE_SIZE(FG%ZAA) ?

@samhatfield
Copy link
Collaborator Author

Good point. Or even STORAGE_SIZE(FG%ZAA, KIND=JPIB)/8.

@samhatfield
Copy link
Collaborator Author

I don't think STORAGE_SIZE will work. Here's why:

From GNU Fortran docs:

The result value is the size expressed in bits for an element of an array that has the dynamic type and type parameters of A.

When I tried out STORAGE_SIZE(FG%ZAA, KIND=JPIB)/8 it printed 4 which is the storage size of a single element of FG%ZAA. Hence I don't think STORAGE_SIZE returns the size of the full array, but rather just one element.

@samhatfield samhatfield merged commit e740d80 into develop Jan 28, 2025
25 checks passed
@samhatfield samhatfield deleted the fix_work_array_print_overflow branch January 28, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants