Concatenate CICE daily output #31

anton-seaice · 2024-03-22T04:05:30Z

Add script to concatenate daily cice output into one file per month (still containing daily data) and delete the individual daily files. This relies on adding nco to the payu environment per ACCESS-NRI/payu-condaenv#24.

Following Aidan's suggestion, the script is taken from https://github.com/COSIMA/1deg_jra55_ryf/blob/master/sync_data.sh#L87-L108, and on that basis I haven't tested beyond checking that it concatenates data.

The only change I made was change the netcdf output type to -4 (netcdf4) instead of -7 (netcdf4_classic).

Looking at ncdump of output looks correct, i.e. it shows time and time bound dimensions of length 31 days for January.

Maybe @aekiss would like to review too?

…still containing daily data). This relies on adding nco to the payu environment per ACCESS-NRI/payu-condaenv#24

github-actions · 2024-03-22T04:05:43Z

❌ Automated testing cannot be run on this branch ❌
Source and Target branches must be of the form dev-<config> and release-<config> respectively, and <config> must match between them.
Rename the Source branch or check the Target branch, and try again.

aidanheerdegen

It would be good to get an idea of how long it takes to concatenate some of the high res data and post the results in the PR or Issue.

If it's time-consuming the script could have PBS directives added and run as a postscript which is then submitted to the queue

aidanheerdegen · 2024-03-22T09:33:43Z

tools/concat_ice_daily.sh

+#concatenate sea-ice daily output
+#script inspired from https://github.com/COSIMA/1deg_jra55_ryf/blob/master/sync_data.sh#L87-L108
+
+for d in archive/output*/ice/OUTPUT; do


This loops over all the output directories. If we're running this each time we shouldn't have to do that.

I can see two options:

Determine the most recent output directory and just run there

invoke this with the run userscript hook and do the concatenation inn the work/ice/OUPUT directory before it is archived.

The issue with option 2 is that there is already a run userscript. I honestly have no idea what would happen if you tried to run two scripts in a single line, say with && or separated with a ;.

Another point: apparently there has been a requirement in the past to concatenate 6 hourly data in the past

https://github.com/COSIMA/01deg_jra55_iaf/blob/01deg_jra55v140_iaf_cycle4/concat_ice_6hourlies.sh#L6

Is there a way we could accomodate that use case as well in a general way I wonder.

If we assume that the use might configure the output to be saved in any number of hours, this gets hard ...

Would assume there will always be data saved at 12 hours (i.e. every combination of 1/2/3/4/6/12 hourly would save a 12 hours, then we can find the days by something like $output_dir/iceh*.????-??-01-43200.nc). Messy but probably ok.

The other complexity is that CICE timestamps are at the end of the time period. e.g. with hourly data, there is a file with a name at the midnight at the end of the month. So for January, there is a archive/output001/ice/OUTPUT/iceh_03h.1901-02-01-00000.nc file made, but this contains January data.

So I don't know how, in Bash, to make a list of all files for a month working with both those conditions ? We would probable need to use a calendar tool?

Thanks for the thoughtful engagement. It sounds like we should probably shoot for the common use case to begin with and make an issue to update to a more general form at a later date.

aidanheerdegen · 2024-03-22T10:16:18Z

tools/concat_ice_daily.sh

+    for f in $d/iceh.????-??-01.nc; do
+        if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ ! -f ${f/-01.nc/-daily.nc} ]];
+        then
+            touch ${f/-01.nc/-IN-PROGRESS}
+            echo "doing ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}"
+            ${PAYU_PATH}/ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc} && chmod g+r ${f/-01.nc/-daily.nc} && rm ${f/-01.nc/-IN-PROGRESS}
+            if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ -f ${f/-01.nc/-daily.nc} ]];
+            then
+                for daily in ${f/-01.nc/-??.nc}
+                do
+                    # mv $daily $daily-DELETE  # rename individual daily files - user to delete
+                    rm $daily
+                done
+            else
+                rm ${f/-01.nc/-IN-PROGRESS}
+            fi
+        fi
+    done


Suggested change

for f in $d/iceh.????-??-01.nc; do

if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ ! -f ${f/-01.nc/-daily.nc} ]];

then

touch ${f/-01.nc/-IN-PROGRESS}

echo "doing ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}"

${PAYU_PATH}/ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc} && chmod g+r ${f/-01.nc/-daily.nc} && rm ${f/-01.nc/-IN-PROGRESS}

if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ -f ${f/-01.nc/-daily.nc} ]];

then

for daily in ${f/-01.nc/-??.nc}

do

# mv $daily $daily-DELETE # rename individual daily files - user to delete

rm $daily

done

else

rm ${f/-01.nc/-IN-PROGRESS}

fi

fi

done

# Don't error if there are no matching patterns

shopt -s nullglob

# Assuming `$d` contains the directory where the data resides

for first_file in $d/iceh.????-??-01.nc

do

# Make a list of all files we wish to concatenate

icefiles=(${first_file/-01.nc/-??.nc})

if [ ${#icefiles[@]} -gt 0 ]

then

iceout="${first_file/-*.nc/-daily.nc}"

ncrcat -O -L 5 -4 "${icefiles[@]}" ${iceout} && rm "${icefiles[@]}"

fi

done

Personally I prefer to just delete the files if the return status of the ncrcat command is ok. Making temporary files ends up introducing extra logic to deal with them.

Note the above is untested, just a suggestion for how to reduce the complexity of the logic.

Making temporary files ends up introducing extra logic to deal with them.

This is copied from the COSIMA scripts. I assume the temporary files were needed for some edge case? @aekiss - Do you know why the temporary files were used?

Sorry to badger you @aekiss but I'm also curious if there were cases of data loss that prompted the design you implemented?

Andrew said it was just for sanity / checking debug in case of failure. I am happy to remove it.

anton-seaice · 2024-03-24T21:18:35Z

It would be good to get an idea of how long it takes to concatenate some of the high res data and post the results in the PR or Issue.

How would one do this? Are there ways to log the PBS "CPU Time" between user scripts?

anton-seaice · 2024-03-25T04:17:42Z

It would be good to get an idea of how long it takes to concatenate some of the high res data and post the results in the PR or Issue.

If it's time-consuming the script could have PBS directives added and run as a postscript which is then submitted to the queue

With one month of 0.1 degree data, this takes ~2.5 minutes to run in the login node. Compared to approx 1.6 hours of Walltime for the model to run. (Amazingly it turns 3.6GB into 1.5GB too!)

This doesn't parallelise, and thats ~2-3% of the walltime, so I guess it is worth worrying about?

(Reducing compression to level 1, reduces the time to ~1m50sec but file size goes up ~6%)

jo-basevi · 2024-03-26T03:50:57Z

Instead of using an archive userscript, could instead use the sync userscript? This will run prior to running any sync commands, and it has the benefit of running on a job with less resources. Only con I can think of is that it'll only run when sync is enabled, but I guess that'll be similar to how the post-script sync-data.sh script originally worked.

aidanheerdegen · 2024-03-26T06:38:51Z

Amazingly it turns 3.6GB into 1.5GB too!

IKR. There is a reason this is worth doing.

Only con I can think of is that it'll only run when sync is enabled, but I guess that'll be similar to how the post-script sync-data.sh script originally worked

That is a good idea thanks @jo-basevi.

I think it does run into the issue that if someone turns sync on part way through their experiment not all of the ice data files will be smooshed together (technical term). But we can be explicit that users should set sync for this reason (and for other reasons like their data evaporating). We could also give instructions on how to run the command directly on output directories that haven't had their ice files smooshed.

I did wonder if we couldn't define a collate option for CICE that did this work. Ultimately it is a nice idea, but not doable in the time frames we have available and we should plough ahead with using a sync userscript.

anton-seaice · 2024-03-26T23:05:23Z

I think moving to a payu postscript is the best plan, as it runs as a seperate PBS job, this reduces the resources held waiting for a single PE job to complete?

I did wonder if we couldn't define a collate option for CICE that did this work. Ultimately it is a nice idea, but not doable in the time frames we have available and we should plough ahead with using a sync userscript.

I think we might get rid of the need for this step in OM3, or at least remove the grid from the CICE output.

Also - we've added 'nco' as a dependency in some cases. Do we need to document this somewhere (for users who don't use 'vk83' ) ?

anton-seaice · 2024-03-27T02:40:24Z

It looks like setting this as a postscript would stop the sync from running?

https://payu.readthedocs.io/en/latest/config.html#postprocessing

github-actions · 2024-03-27T04:22:50Z

❌ Automated testing cannot be run on this branch ❌
Source and Target branches must be of the form dev-<config> and release-<config> respectively, and <config> must match between them.
Rename the Source branch or check the Target branch, and try again.

anton-seaice · 2024-03-27T04:24:28Z

@aidan - I have updated based on the review comments. Back to you.

I switched to using the system nco module, rahter than adding to payu-env ?

I cleaned up the script to remove the uneeded operations and only check the last archive folder.

jo-basevi · 2024-03-27T04:46:51Z

Yeah if postscript is used, and sync is enabled, then it won’t rsync the latest output, as payu has no idea on when the postscript job completes or whether it modifies the current output. It’ll still rsync up outputs prior to the last one, if they haven’t already been synced.

Also, if syncing is enabled, a sync job by default runs on a copyq queue. It starts after the main payu run job or collate job, if it’s enabled, has completed. Its resources can be configured similarly to the collate config so can be configured to more/less resources if needed.

anton-seaice · 2024-03-28T04:21:36Z

tools/concat_ice_daily.sh

+#concatenate sea-ice daily output
+#script inspired from https://github.com/COSIMA/1deg_jra55_ryf/blob/master/sync_data.sh#L87-L108
+
+out_dir=$(ls -td archive/output??? | head -1)/ice/OUTPUT #latest output dir only


Suggested change

out_dir=$(ls -td archive/output??? | head -1)/ice/OUTPUT #latest output dir only

out_dir=$(ls -dr archive/output??? | head -1)/ice/OUTPUT #latest output dir only

anton-seaice · 2024-03-28T04:22:53Z

tools/concat_ice_daily.sh

+for f in $out_dir/iceh.????-??-01.nc; do
+    #concat daily files for this month
+    echo "doing ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}"
+    ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc} 


Suggested change

ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}

{PAYU_PATH}/ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}

anton-seaice · 2024-03-28T04:23:07Z

config.yaml

+
+modules:
+  load:
+    - nco/5.0.5


Suggested change

- nco/5.0.5

aidanheerdegen · 2024-03-28T04:54:40Z

Looking down the barrel of adding this to every config, and not being confident we wouldn't have to update it in the future (see conversation about 6 hourly concatenation) I've made a new repo and moved the code to a PR there

ACCESS-NRI/om2-scripts#1

When we've got that merged I'll manually pop it in vk83 (like I did with mppnccombine-fast) and we'll work on a longer term solution later.

Sorry for mucking you about @anton-seaice

anton-seaice · 2024-03-28T05:02:06Z

Ok - no worries. Ill put my changes there. Do we still need to update the config.yaml here?

aidanheerdegen · 2024-03-28T06:05:00Z

Do we still need to update the config.yaml here?

Maybe we'll leave this open and just update the config.yaml as an exemplar when we have a final location of the concat script.

aidanheerdegen · 2024-04-03T09:39:40Z

This was superseded by a script in a separate repo in this PR

ACCESS-NRI/om2-scripts#1

Add script to concatenate daily cice output into one file per month (…

e94786d

…still containing daily data). This relies on adding nco to the payu environment per ACCESS-NRI/payu-condaenv#24

anton-seaice requested a review from aidanheerdegen March 22, 2024 04:05

anton-seaice self-assigned this Mar 22, 2024

aidanheerdegen reviewed Mar 22, 2024

View reviewed changes

Use nco from system modules and clean up concat ice script

bd7f4ea

anton-seaice commented Mar 28, 2024

View reviewed changes

config.yaml

modules:

load:

- nco/5.0.5

Copy link

Contributor Author

anton-seaice Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

- nco/5.0.5

aidanheerdegen mentioned this pull request Mar 28, 2024

Add CICE5 data concatenation script ACCESS-NRI/om2-scripts#1

Merged

aidanheerdegen closed this Apr 3, 2024

anton-seaice mentioned this pull request Apr 4, 2024

Ice concatenation fragile: requires nco tools loaded ACCESS-NRI/om2-scripts#2

Open

CodeGat deleted the iss28 branch June 19, 2024 03:54

anton-seaice mentioned this pull request Jun 28, 2024

Updates to config post-processing COSIMA/access-om3#182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenate CICE daily output #31

Concatenate CICE daily output #31

anton-seaice commented Mar 22, 2024

github-actions bot commented Mar 22, 2024

aidanheerdegen left a comment

aidanheerdegen Mar 22, 2024

anton-seaice Mar 26, 2024

aidanheerdegen Mar 26, 2024

aidanheerdegen Mar 22, 2024

anton-seaice Mar 24, 2024

aidanheerdegen Mar 26, 2024

anton-seaice Mar 27, 2024

anton-seaice commented Mar 24, 2024

anton-seaice commented Mar 25, 2024

jo-basevi commented Mar 26, 2024

aidanheerdegen commented Mar 26, 2024

anton-seaice commented Mar 26, 2024

anton-seaice commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

anton-seaice commented Mar 27, 2024

jo-basevi commented Mar 27, 2024

anton-seaice Mar 28, 2024

anton-seaice Mar 28, 2024

anton-seaice Mar 28, 2024

aidanheerdegen commented Mar 28, 2024

anton-seaice commented Mar 28, 2024

aidanheerdegen commented Mar 28, 2024

aidanheerdegen commented Apr 3, 2024

	out_dir=$(ls -td archive/output??? \| head -1)/ice/OUTPUT #latest output dir only
	out_dir=$(ls -dr archive/output??? \| head -1)/ice/OUTPUT #latest output dir only

	ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}
	{PAYU_PATH}/ncrcat -O -L 5 -4 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}

Concatenate CICE daily output #31

Concatenate CICE daily output #31

Conversation

anton-seaice commented Mar 22, 2024

github-actions bot commented Mar 22, 2024

aidanheerdegen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anton-seaice commented Mar 24, 2024

anton-seaice commented Mar 25, 2024

jo-basevi commented Mar 26, 2024

aidanheerdegen commented Mar 26, 2024

anton-seaice commented Mar 26, 2024

anton-seaice commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

anton-seaice commented Mar 27, 2024

jo-basevi commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aidanheerdegen commented Mar 28, 2024

anton-seaice commented Mar 28, 2024

aidanheerdegen commented Mar 28, 2024

aidanheerdegen commented Apr 3, 2024