Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak Fixes #28

Merged
merged 19 commits into from
Jul 10, 2024
Merged

Memory Leak Fixes #28

merged 19 commits into from
Jul 10, 2024

Conversation

hellkite500
Copy link
Contributor

@hellkite500 hellkite500 commented Jun 13, 2024

Serious memory leaks exist in this implementation, as noted in #27. Running through the llvm address-sanitizer with leak detection, the total memory leaked by the bmi test using

lasam_standalone ./configs/config_lasam_Bushland.txt

is nearly 30 MB!

SUMMARY: AddressSanitizer: 26907277 byte(s) leaked in 512522 allocation(s).

This happens predominantly because the linked list data structure supporting the layered wetting fronts dynamically allocates list elements using malloc, but these are never freed/deleted. These lists are then copied using listCopy, between each time step to track previous state and current state, but once copied, the old memory is inaccessible and "leaks". This is extremely problematic when used in the NextGen model engine (ngen) since we potentially initialize over 800k models, and then run each for a period of time ranging from hours to decades! These memory leaks at that scale quickly exhaust system resources (as evident by the 10 day CONUS tests where #27 was first identified!).

Back of the envelop math implies this could leak up to 24 TB!!!! of memory in only a year long simulation (the length of time of the simulation used in the test above)!

This PR address the bulk of the memory leaks, especially those occurring in the BMI implementation. With these changes, the memory leaks are now minimal and seem to be related mostly to the use of fprintf and scanf and shouldn't be leaking memory for every simulation time step!

SUMMARY: AddressSanitizer: 949 byte(s) leaked in 11 allocation(s).

Care should be taken in future developments and work with this customized linked list to ensure that dynamic memory allocations are properly freed. This PR also shows where other model states and temporaries have dynamic memory allocation and had no corresponding cleanup code. Particularly with dynamic allocation in loops (such as in Update which had nested loops in a function intended to be called in a loop) dynamic memory use must be handled correctly.

Additions

  • listDelete in linked_list.cxx
  • bmi Finalize implemenation
  • destructor for BMI class

Changes

  • Ensure dynamic memory allocations are deleted before the pointers are lost

Testing

  1. Ran the standalone code with address sanitizer to confirm fixed memory leaks

Notes

  • For some weird reason, running the standalone test with this PR code produces the output twice, but the values are the same and consistent with the output variables run from master before these changes are applied.

Todos

  • Noted a couple other odd things in comments in the code for anyone curious.

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Target Environment support

  • Linux
  • MacOS

Merging this PR closes #27.

@hellkite500 hellkite500 force-pushed the mem-fix branch 2 times, most recently from 34c7b5b to 2288e90 Compare June 13, 2024 02:57
@hellkite500 hellkite500 added the bug Something isn't working label Jun 13, 2024
@stcui007
Copy link
Contributor

I just cloned the PR code and did a run using Bushland and Phillipsburg initial configs.
Valgrind report for Bushland:

==106320== LEAK SUMMARY:
==106320==    definitely lost: 0 bytes in 0 blocks
==106320==    indirectly lost: 0 bytes in 0 blocks
==106320==      possibly lost: 0 bytes in 0 blocks
==106320==    still reachable: 1,824 bytes in 27 blocks
==106320==         suppressed: 0 bytes in 0 blocks

For Phillipsburg:

==107020== LEAK SUMMARY:
==107020==    definitely lost: 6,701,496 bytes in 279,229 blocks
==107020==    indirectly lost: 0 bytes in 0 blocks
==107020==      possibly lost: 48 bytes in 2 blocks
==107020==    still reachable: 2,424 bytes in 52 blocks
==107020==         suppressed: 0 bytes in 0 blocks

Don't understand why the initial config causes the difference.

@stcui007
Copy link
Contributor

I put the full report here for convenience:

 valgrind --leak-check=full ./build/lasam_standalone configs/config_lasam_Phillipsburg.txt
==110617== Memcheck, a memory error detector
==110617== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==110617== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==110617== Command: ./build/lasam_standalone configs/config_lasam_Phillipsburg.txt
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E852: __ieee754_pow_sse2 (e_pow.c:75)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E88A: __ieee754_pow_sse2 (e_pow.c:86)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E8B0: __ieee754_pow_sse2 (e_pow.c:95)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2E8C7: __ieee754_pow_sse2 (e_pow.c:95)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==
==110617== Conditional jump or move depends on uninitialised value(s)
==110617==    at 0x5E2ED91: __ieee754_pow_sse2 (e_pow.c:96)
==110617==    by 0x5E403EB: pow (w_pow.c:27)
==110617==    by 0x4172E0: calc_theta_from_h(double, double, double, double, double, double) (soil_funcs.cxx:149)
==110617==    by 0x41530B: lgar_read_vG_param_file(char const*, int, double, soil_properties_*) (lgar.cxx:2299)
==110617==    by 0x40EC95: InitFromConfigFile(std::string, model_state*) (lgar.cxx:542)
==110617==    by 0x40C935: lgar_initialize(std::string, model_state*) (lgar.cxx:77)
==110617==    by 0x40748C: BmiLGAR::Initialize(std::string) (bmi_lgar.cxx:39)
==110617==    by 0x4038B6: main (bmi_main_lgar.cxx:50)
==110617==

*********************************************************
-------------------- Simulation Summary -----------------
------------------------ Mass balance -------------------
Initial water in soil     =  45.1158503556 cm
Total precipitation       =  99.0854000000 cm
Total infiltration        =  88.8044496039 cm
Final water in soil       =  45.6315405858 cm
Surface ponded water      =   0.0000000000 cm
Surface runoff            =  10.2809503961 cm
GIUH runoff               =  10.2809503961 cm
Total percolation         =   0.0000000000 cm
Total AET                 =  88.2887593432 cm
Total PET                 = 147.9532940266 cm
Total discharge (Q)       =  10.2809503961 cm
Vol change (calibration)  =   0.0000000000 cm
Global balance            =   3.057077e-08 cm

*********************************************************
-------------------- Simulation Summary -----------------
------------------------ Mass balance -------------------
Initial water in soil     =  45.1158503556 cm
Total precipitation       =  99.0854000000 cm
Total infiltration        =  88.8044496039 cm
Final water in soil       =  45.6315405858 cm
Surface ponded water      =   0.0000000000 cm
Surface runoff            =  10.2809503961 cm
GIUH runoff               =  10.2809503961 cm
Total percolation         =   0.0000000000 cm
Total AET                 =  88.2887593432 cm
Total PET                 = 147.9532940266 cm
Total discharge (Q)       =  10.2809503961 cm
Vol change (calibration)  =   0.0000000000 cm
Global balance            =   3.057077e-08 cm
Time                      =   80.11 sec
==110617==
==110617== HEAP SUMMARY:
==110617==     in use at exit: 6,703,968 bytes in 279,283 blocks
==110617==   total heap usage: 797,221 allocs, 517,938 frees, 39,024,780 bytes allocated
==110617==
==110617== 24 bytes in 1 blocks are possibly lost in loss record 12 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 24 bytes in 1 blocks are possibly lost in loss record 13 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 5,736 bytes in 239 blocks are definitely lost in loss record 32 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x4083EA: BmiLGAR::Update() (bmi_lgar.cxx:268)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 5,736 bytes in 239 blocks are definitely lost in loss record 33 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x4083EA: BmiLGAR::Update() (bmi_lgar.cxx:268)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 3,344,688 bytes in 139,362 blocks are definitely lost in loss record 34 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BC6: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1226)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== 3,345,216 bytes in 139,384 blocks are definitely lost in loss record 35 of 35
==110617==    at 0x4C29EC3: malloc (vg_replace_malloc.c:309)
==110617==    by 0x411BE4: lgar_move_wetting_fronts(double, double*, int, double, int, double*, double*, int*, double*, wetting_front**, wetting_front*, soil_properties_*) (lgar.cxx:1227)
==110617==    by 0x408932: BmiLGAR::Update() (bmi_lgar.cxx:357)
==110617==    by 0x403F9D: main (bmi_main_lgar.cxx:129)
==110617==
==110617== LEAK SUMMARY:
==110617==    definitely lost: 6,701,376 bytes in 279,224 blocks
==110617==    indirectly lost: 0 bytes in 0 blocks
==110617==      possibly lost: 48 bytes in 2 blocks
==110617==    still reachable: 2,544 bytes in 57 blocks
==110617==         suppressed: 0 bytes in 0 blocks
==110617== Reachable blocks (those to which a pointer was found) are not shown.
==110617== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==110617==
==110617== For counts of detected and suppressed errors, rerun with: -v
==110617== Use --track-origins=yes to see where uninitialised values come from
==110617== ERROR SUMMARY: 96 errors from 11 contexts (suppressed: 0 from 0)

@hellkite500
Copy link
Contributor Author

@stcui007 That config triggered a slightly different code path which had some temporary allocations that weren't free'd correctly. I ran the config and was able to fix up that leak as well in e652bb2.

@stcui007
Copy link
Contributor

stcui007 commented Jun 13, 2024 via email

src/bmi_lgar.cxx Outdated Show resolved Hide resolved
src/bmi_lgar.cxx Outdated Show resolved Hide resolved
src/linked_list.cxx Outdated Show resolved Hide resolved
src/main.cxx Outdated Show resolved Hide resolved
Copy link
Contributor

@peterlafollette peterlafollette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work everyone! Great that the memory leak issues have been so thoroughly identified / addressed. In future developments, we'll make sure we free dynamically allocated memory.

I've tested the changed code in standalone and in ngen, running multiple catchments from the same realization, and for the most part things seem to be good. The one change I would suggest is removing global_mass_balance from bmi_main_lgar.cxx because it is also called in Finalize (see my comment), so the simulation results are currently printed twice (at least in standalone mode). Please also note that there has been an accepted PR since this one, so I think rebasing might be necessary. Aside from these minor things, this looks great.

@stcui007
Copy link
Contributor

stcui007 commented Jul 3, 2024 via email

@peterlafollette
Copy link
Contributor

peterlafollette commented Jul 3, 2024 via email

@stcui007
Copy link
Contributor

stcui007 commented Jul 3, 2024 via email

hellkite500 and others added 4 commits July 10, 2024 13:16
Co-authored-by: Phil Miller - NOAA <[email protected]>
* adding the capability for an adaptive time step. If not provided in the config file, it defaults to false.

* updating readme

* updating readme

* updating giuh fxn with adaptive time step

* giuh now uses a fixed time step that is hardcoded to be the the same as the smallest adaptive time step for the rest of the model. If the adaptive time step is off, the giuh time step will be the same as the timestep input in the config file.

* updating readme in configs

* cleaned up code that was used for debugging

* updating Phillipsburg config file

* cleaned up code that was used for debugging

* After chatting with Ahmad, we determined that it is best if the adaptive time step strategy does not alter the function in giuh.c because this function is used in multiple models. So, the adaptive time step method has been altered such that the giuh is computed at the hourly time step rather than at the sub timestep. This allows for the giuh fxn to be unaltered, and has the added bonus of not reqiring resampling for the giuh based on time step.

* updating in response to comments, mostly focused on removing hardcoding from min and max time step values

* more changes in response to comments, including formatting of giuh.c, removing hardcoding for internal time steps and replacing with user specified minimum time step, and establishing a maximum time step equal to the forcing resolution

* removing code that was commented out

---------

Co-authored-by: Peter La Follette <[email protected]>
Co-authored-by: Peter La Follette <[email protected]>
@hellkite500
Copy link
Contributor Author

This reminds me a similar thing. It seems that in lgar.cxx, it prints our "Mass balance" info every time step. For multi-processing jobs it does this for every process. This becomes an issue for large scale processing, on CONUS, for example, not only slows down the computation but also write out a big file if you use nohup. Just wonder if this part can be optionally made inactive when running in the framework.

This should be tracked in a separate issue.

Copy link
Contributor

@peterlafollette peterlafollette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again great work all. I've tested the updated code in ngen mode (running 1 or multiple catchments from one realization file) and in standalone mode, and the results are all as expected. From my perspective, the branch can be merged to master.

@hellkite500 hellkite500 merged commit a004431 into NOAA-OWP:master Jul 10, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential LGAR-C memory leak detected by Valgrind
5 participants