Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI parallelization for HPC #169

Open
skvarjun opened this issue Sep 14, 2024 · 2 comments
Open

MPI parallelization for HPC #169

skvarjun opened this issue Sep 14, 2024 · 2 comments
Labels
p: normal t: task Label for general tasks

Comments

@skvarjun
Copy link

I was trying to run Malamute in HPC cluster with 24 cores. The program runs smoothly in series with the command;

malamute-opt -i dcs5_5_mm_constant_properties.i >& log.out

For MPI, the execution line is as follows;

mpirun -n 24 malamute-opt -i dcs5_5_mm_constant_properties.i >& log.out

The log file throws an error as shown below;

M M A L A M M U U T T T T E E E E
MM MM A A L A A MM MM U U T E
M M M M A A L A A M M M M U U T E E E E
M M M A A A A L A A A A M M M U U T E
M M A A L L L L A A M M U U U T E E E E

MALAMUTE: MOOSE Application Library for Advanced Manufacturing UTilitiEs

Copyright 2021 - 2024, Battelle Energy Alliance, LLC
ALL RIGHTS RESERVED

NOTICE: These data were produced by BATTELLE ENERGY ALLIANCE, LLC under Contract
No. DE-AC07-05ID14517 with the Department of Energy. For ten(10) years from
July 8, 2021, the Government is granted for itself and others acting on its behalf
a nonexclusive, paid-up, irrevocable worldwide license in this data to reproduce,
prepare derivative works, and perform publicly and display publicly, by or on
behalf of the Government. There is provision for the possible extension of the
term of this license. Subsequent to that period or any extension granted, the
Government is granted for itself and others acting on its behalf a nonexclusive,
paid-up, irrevocable worldwide license in this data to reproduce, prepare
derivative works, distribute copies to the public, perform publicly and display
publicly, and to permit others to do so. The specific term of the license can be
identified by inquiry made to Contractor or DOE. NEITHER THE UNITED STATES NOR
THE UNITED STATES DEPARTMENT OF ENERGY, NOR ANY OF THEIR EMPLOYEES, MAKES ANY
WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY
FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY DATA, APPARATUS, PRODUCT,
OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY
OWNED RIGHTS.

In UnstructuredMesh::stitch_meshes:
This mesh has 63 nodes on boundary bottom_ram_spacer_top' (2). Other mesh has 21 nodes on boundary ' (4).
Minimum edge length on both surfaces is 0.001.
In UnstructuredMesh::stitch_meshes:
Found 21 matching nodes.

In UnstructuredMesh::stitch_meshes:
This mesh has 81 nodes on boundary bottom_sinter_spacer_top' (14). Other mesh has 55 nodes on boundary ' (16).
Minimum edge length on both surfaces is 0.0005.
In UnstructuredMesh::stitch_meshes:
Found 55 matching nodes.

In UnstructuredMesh::stitch_meshes:
This mesh has 63 nodes on boundary top_ram_spacer_bottom' (44). Other mesh has 21 nodes on boundary ' (50).
Minimum edge length on both surfaces is 0.001.
In UnstructuredMesh::stitch_meshes:
Found 21 matching nodes.

In UnstructuredMesh::stitch_meshes:
This mesh has 81 nodes on boundary top_sinter_spacer_bottom' (32). Other mesh has 55 nodes on boundary ' (38).
Minimum edge length on both surfaces is 0.0005.
In UnstructuredMesh::stitch_meshes:
Found 55 matching nodes.

Framework Information:
MOOSE Version: git commit 4f939da0bd on 2024-09-10
LibMesh Version:
PETSc Version: 3.21.4
SLEPc Version: 3.21.1
Current Time: Fri Sep 13 22:37:17 2024
Executable Timestamp: Wed Sep 11 13:34:12 2024

Checkpoint:
Wall Time Interval: Every 3600 s
User Checkpoint: Disabled
Checkpoints Kept: 2
Execute On: TIMESTEP_END

Parallelism:
Num Processors: 24
Num Threads: 1

Mesh:
Parallel Type: replicated
Mesh Dimension: 2
Spatial Dimension: 2
Nodes:
Total: 21788
Local: 949
Min/Max/Avg: 846/989/907
Elems:
Total: 5744
Local: 242
Min/Max/Avg: 218/266/239
Num Subdomains: 34
Num Partitions: 24
Partitioner: metis

Nonlinear System:
Num DOFs: 44636
Num Local DOFs: 1952
Variables: { "temperature" "potential" } { "temperature_bottom_ram_cc_lm" "potential_bottom_ram_cc_lm"
} { "temperature_bottom_cc_sinter_lm" "potential_bottom_cc_sinter_lm" } { "temperature_bottom_sinter_punch_lm"
"potential_bottom_sinter_punch_lm" } { "temperature_bottom_punch_powder_lm"
"potential_bottom_punch_powder_lm" } { "temperature_powder_top_punch_lm"
"potential_powder_top_punch_lm" } { "temperature_top_punch_sinter_lm" "potential_top_punch_sinter_lm"
} { "temperature_top_sinter_cc_lm" "potential_top_sinter_cc_lm" } { "temperature_top_cc_ram_lm"
"potential_top_cc_ram_lm" } { "temperature_inside_low_punch_lm" "potential_inside_low_punch_lm"
} { "temperature_inside_powder_lm" "potential_inside_powder_lm" } { "temperature_inside_top_punch_lm"
"potential_inside_top_punch_lm" } "temperature_gap_top_sinter_die_lm" "temperature_gap_bottom_sinter_die_lm"

Finite Element Types: "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE"
"LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE" "LAGRANGE"

Approximation Orders: "SECOND" "SECOND" "SECOND" "SECOND" "SECOND" "SECOND" "SECOND" "SECOND" "SECOND"
"SECOND" "SECOND" "SECOND" "SECOND" "SECOND"

Auxiliary System:
Num DOFs: 53289
Num Local DOFs: 2277
Variables: "heat_transfer_radiation" { "electric_field_x" "electric_field_y" } "interface_normal_lm"

Finite Element Types: "LAGRANGE" "MONOMIAL" "LAGRANGE"
Approximation Orders: "SECOND" "FIRST" "FIRST"

Execution Information:
Executioner: Transient
TimeStepper: ConstantDT
TimeIntegrator: ImplicitEuler
Solver Mode: NEWTON
PETSc Preconditioner: lu mat_superlu_dist_fact: SamePattern mat_superlu_dist_replacetinypivot: true
MOOSE Preconditioner: SMP

LEGACY MODES ENABLED:
This application uses the legacy initial residual evaluation behavior. The legacy behavior performs an often times redundant residual evaluation before the solution modifying objects are executed prior to the initial (0th nonlinear iteration) residual evaluation. The new behavior skips that redundant residual evaluation unless the parameter Executioner/use_pre_smo_residual is set to true. To remove this message and enable the new behavior, set the parameter 'use_legacy_initial_residual_evaluation_behavior' to false in *App.C. Some tests that rely on the side effects of the legacy behavior may fail/diff and should be re-golded.

Time Step 0, time = 0

Postprocessor Values:
+----------------+----------------+----------------+
| time | applied_current| pyrometer_point|
+----------------+----------------+----------------+
| 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
+----------------+----------------+----------------+

Time Step 1, time = 6, dt = 6
Pre-SMO residual: -nan
Nonlinear solve did not converge due to DIVERGED_FNORM_NAN iterations 0
Solve Did NOT Converge!
Aborting as solve did not converge

Time Step 1, time = 3, dt = 3
Pre-SMO residual: -nan
Nonlinear solve did not converge due to DIVERGED_FNORM_NAN iterations 0
Solve Did NOT Converge!
Aborting as solve did not converge
.
.
.
Time Step 1, time = 1.36424e-12, dt = 1.36424e-12
Pre-SMO residual: -nan
Nonlinear solve did not converge due to DIVERGED_FNORM_NAN iterations 0
Solve Did NOT Converge!
Aborting as solve did not converge

Time Step 1, time = 1e-12, dt = 1e-12
Pre-SMO residual: -nan
Nonlinear solve did not converge due to DIVERGED_FNORM_NAN iterations 0
Solve Did NOT Converge!
Aborting as solve did not converge

*** ERROR ***
The following error occurred in the TimeStepper 'ConstantDT' of type ConstantDT.

Solve failed and timestep already at or below dtmin, cannot continue!

Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

It would be helpful if someone could point out the issue! Thank you

@skvarjun skvarjun added p: normal t: task Label for general tasks labels Sep 14, 2024
@sapitts
Copy link
Collaborator

sapitts commented Sep 16, 2024

@cticenhour do you think this issue might be related to the discussion with lower dimension projection on MOOSE? idaholab/moose#28599

@cticenhour
Copy link
Member

Without taking a closer look, that's a decent guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p: normal t: task Label for general tasks
Projects
None yet
Development

No branches or pull requests

3 participants