-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add payu sync
cmd for syncing archive to a remote directory
#360
Add payu sync
cmd for syncing archive to a remote directory
#360
Conversation
Closes #200 |
4c5f149
to
a05168e
Compare
Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2023-11-02 00:36:33 UTC |
Extra Notes/Questions Syncing and postscript post-processingCurrently if post-script is configured AND automatic syncing is enabled, the latest output (and lastest restarts, if syncing restarts is enabled) will not be synced. To manually sync the latest outputs after the post-script job has completed, run: Instead of post-script, to automatically ensure post-processing happens before syncing, there's a sync If post-processing needs to run on a different queue to sync job (if need network access for syncing to a remote machine and post-processing >1 NCPUs), or it requires significant resources, an option could be to call If the above won't work, an option could be to add to payu a post-processing command to submit a post-processing job which runs any shell/python post-processing userscripts and then either runs Local delete optionsThere is a config option to remove local files once synced. As this leaves behind empty directories and files excluded from rsync, there is another config option to remove local restart/output directories. This also happens after the directory has been successfully synced. The latest output (and last "permanently-saved" restart and subsequent restarts) are protected from both local delete options. Any additional configured paths to sync in A procedure to remove all local files and directories - including last outputs/restarts could be:
Syncing to remote machine
I've attempted to go through the above guide above and rsync small files from gadi to my local desktop but haven't had any success. So I've left the remote sync options (e.g Uncollated filesCurrently, |
Just checking - the current sync script rsyncs all outputs (not just the latest) every time it's called, to handle asynchronous collation. Does |
Yes, it rsyncs all the outputs directories every time it's called. I had looked into rsyncing only the current output or current/ prior restart- if collation is enabled, however realised there wasn't much point as rsync would be able to quickly pass over already synced directories in archive anyway.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a complex thing to achieve in a legacy codebase, so good work, thanks.
I do appreciate the thorough testing!
There are few minor changes, which I think should be addressed, but otherwise looks great.
Thanks @aidanheerdegen for the review! I've just added in the requested changes. I've tested it locally but will have to wait until gadi is back up and running to run through a couple of actual model output tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed to want another review, so .... LGTM!
- Extended postprocess() to run `payu sync`, if syncing is enabled - Automatically check sync path for storage paths
- If syncing restarts is enabled, only sync restarts that will be permanently archived - Add config options for syncing to remote machine, - Add payu sync cmd options to sync all restarts - Add tests - Change logic to automatically sync all output (rather than just the lastest output)
- Update storage path check to look for sync path in config.yaml - Add options for local delete of files/dirs after syncing - Add protected paths in get_archive_paths_to_sync. This is protect the last output, and last saved restart (needed for date-based restart pruning) from delete local options - remove_local_files config flag for removing local files once synced - remove_local_dirs config flag for removing local restart/output dirs onced synced. This will remove any empty dirs after rsync operation and any files that were excluded from rsync. - Add excludes options - Add single or list options to extra paths to sync and exclude - Add documention for sync configuration options and usage - Add runlog option to sync which defaults to True - Remove hyperthreading in sync command, and explicitly add a default walltime - Raise error when sync path is not defined - Remove sync ssh keys - Add flag for syncing uncollated files which defaults to True when collation is enabled.
472b9f9
to
9ac075a
Compare
I've tested it all works on gadi, I just added on line in the tests to disable attempting to sync runlog. I also tidied up the commit history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM!
Add a new payu command to sync output to a remote directory. I've extended the expt.postprocess(), which either runs after collation or end of archive if not collating, to also call 'payu sync' if syncing is enabled.
An example config currently is
The cosima's sync_data.sh (https://github.com/COSIMA/1deg_jra55_iaf/blob/master/sync_data.sh) does some extra processing, e.g: runs a run_summary python script and concatenates ice daily files.
An example of postprocess_archive.sh could look like for 1deg_jra55_iaf
Edited: 2/12/23