-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding experiment uuid, metadata and branch support #384
Adding experiment uuid, metadata and branch support #384
Conversation
Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2023-12-22 01:00:15 UTC |
Metadata fileI like the idea of using
Makes sense.
Do we need a
To delete all the lab stuff:
So probably need a nuclear option:
with a serious warning/confirmation dialog. Continuation of pre-existing experiments
This is interesting, a "crossing of the rubicon" as I don't believe we've tried to write to Do we want to go here? Could we use the restart manifest for this? Small point, could
I don't have a problem with this, but it does require synchronisation of multiple bits of information so The It might make sense to just check if the corresponding
Again could
We already have
I think this is an important feature to include in the first iteration of the branching model. Maybe that was always the intention, and if so I would agree. Third party libsshortuuidIn your test It was clear that short uuid was a lot better for avoiding collisions with truncated hashes. In terms of what is acceptable, I think 1 in 20 is probably too high (shortuuid length 4), but 1 in 1000 seems fine to me (shortuuid 5) and 1 in 500000 for shortuuid 6 is probably overkill. Using the same criteria then uuid4 would need to be at least length 7 (1 in 500). So the decision to use shortuuid comes down to 5 v 7 characters in the shortened URL, adding another dependency, does it have to be supported? (not sure it matters), and maybe legibility? Will it help users if it is easier to remember? Not sure which is easier TBH: 5 mixed case letters and numbers, or 7 numbers. gitpythonSounds like a good idea. Makes the code much simpler and more robust. ruamel.yamlOnly really necessary if we need to modify So .. I haven't had a chance to have a good look at the code yet, this is mostly just commenting on your comprehensive and well structured summary of the problem and your solution. |
Thanks @aidanheerdegen for those comments!
I've added in extra error messages if the branch already exists, to remove the Let me know if you still think its better to leave
Would that just remove a local branch from git?
This may not be needed as currently if there's a branch-uuid aware experiment name in metadata, payu sweep --hard will remove what is in the lab archive under that experiment name. Is the intention of this command so it can be run from another branch?
Thanks, that's good to know - I'll have a look into that.
Yeah I agree, a lot would go wrong. I somewhat hope that
I am not sure on that as I think the default branch should be the main branch of the forked repo. Also having branch name equal control directory, would make the experiment name start with
Yup, I can change that.
Ideally |
This all sounds very sensible and helps the user a lot, so agree 100%
You've convinced me. Keep it.
I was trying to think of how to undo remove a branch and clean up artefacts, but this doesn't seem like a very well thought through suggestion.
Good point.
My thinking was that there was a high likelihood of creating a bunch of branches, with corresponding
Could we repurpose the Another option is to take
Thanks. You make a good point about it being a pretty unusual case.
Sorry, when I said "infer BRANCH_NAME from the CONTROL_DIR" I should have said "infer BRANCH_NAME from information in the CONTROL_DIR, such as the
Tah. If you don't agree though feel free to push back.
Good point. TBH I've always thought it was an omissions for If you think it is reasonable to add that to
I've already piled so much other stuff into this, this is my concession to trying to simplify it! :) But I do think if we can not make another new command it is probably worth (not) doing so. Perhaps if a user wants a new UUID, running |
Ok I’ve had a good look into using restart manifests and the reproduce framework. If I was able to get a restart manifest created in Currently Anyway, I think no matter how I do this (using
I’m not sure how that would work as the
Ok, yeah I am thinking that option might need a few user prompts of
Yeah, I can have a look into that and I'll remove that extra |
As it is currently implemented that is true. I was thinking about changing that ... but the logic of how to do that does my head in.
Oh dear. I didn't remember (or realise) it was being abused quite so much.
Symlinks is how people have done it in the past, and probably still do so. Ok. It seems too difficult/convoluted/potentially dangerous to do anything other than set Sorry to add noise/confusion.
Good point. It is possible to interrogate a file in another branch without checking out the whole branch (can checkout to stdout and read into a buffer for example). I'm not saying that is a good idea, just that it is possible. But your original point is a good one. This sounds difficult so should be fine to just warn if there are inconsistency and leave it at that.
👍
If you think that sounds do-able and a good idea then sounds good to me. |
Thanks again for all the feedback! As discussed, I will park the work for the feature to add in options to sweep all branches in a local repo. The adding metadata to payu init isn't urgent either I think as metadata setup is run in Experiment initialisation in I think the code is ready for review though I do intend to add documentation and more tests for the metadata class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly there isn't much that actually needs changing (few typos mostly), so I'll mark as approved, but if you do make changes you'll need another approval I guess.
The access-nri-intake-catalog has a schema for experiments https://github.com/ACCESS-NRI/schema/blob/main/experiment_asset.json {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the experiment (string)"
},
"experiment_uuid": {
"type": "string",
"format": "uuid",
"description": "Unique uuid for the experiment (string)"
},
"description": {
"type": "string",
"description": "Short description of the experiment (string, < 150 char)"
},
"long_description": {
"type": "string",
"description": "Long description of the experiment (string)"
},
"model": {
"type": "array",
"items": {"type": ["string", "null"]},
"description": "The name(s) of the model(s) used in the experiment (string)"
},
"realm": {
"type": "array",
"items": {
"oneOf": [
{"type": "null"},
{
"type": "string",
"enum": [
"aerosol",
"atmos",
"atmosChem",
"land",
"landIce",
"none",
"ocean",
"ocnBgchem",
"seaIce",
"unknown"
]
}
]
},
"description": "The realm(s) included in the experiment (string)"
},
"frequency": {
"type": "array",
"items": {
"oneOf": [
{"type": "null"},
{
"type": "string",
"oneOf": [
{
"pattern": "^fx$"
},
{
"pattern": "^subhr$"
},
{
"pattern": "^\\d+hr$"
},
{
"pattern": "^\\d+day$"
},
{
"pattern": "^\\d+mon$"
},
{
"pattern": "^\\d+yr$"
},
{
"pattern": "^\\d+dec$"
}
]
}
]
},
"description": "The frequency(/ies) included in the experiment (string)"
},
"variable": {
"type": "array",
"items": {
"type": ["string", "null"]
},
"description": "The variable(s) included in the experiment (string)"
},
"nominal_resolution": {
"type": "array",
"items": {"type": ["string", "null"]},
"description": "The nominal resolution(s) of model(s) used in the experiment (string)"
},
"version": {
"type": ["number", "string", "null"],
"description": "The version of the experiment (number, string)"
},
"contact": {
"type": ["string", "null"],
"description": "Contact name for the experiment (string)"
},
"email": {
"type": ["string", "null"],
"description": "Email address of the contact for the experiment (string)"
},
"created": {
"type": ["string", "null"],
"description": "Initial creation date of experiment (string)"
},
"reference": {
"type": ["string", "null"],
"description": "Citation or reference information (string)"
},
"license": {
"type": ["string", "null"],
"description": "License of the experiment (string)"
},
"url": {
"type": ["string", "null"],
"description": "Relevant url, e.g. github repo for experiment configuration (string)"
},
"parent_experiment": {
"type": ["string", "null"],
"description": "experiment_uuid for parent experiment if appropriate (string)"
},
"related_experiments": {
"type": "array",
"items": {
"type": ["string", "null"]
},
"description": "experiment_uuids for any related experiment(s) (string)"
},
"notes": {
"type": ["string", "null"],
"description": "Additional notes (string)"
},
"keywords": {
"type": "array",
"items": {
"type": ["string", "null"]
},
"description": "Keywords to associated with experiment (string)"
}
},
"required": [
"name",
"experiment_uuid",
"description",
"long_description"
],
"additionalProperties": false
} We should try and match that as much as possible. So that would mean |
Ok that is really good to know! I'll need to do a few name updates. |
Hi @dougiesquire, I have a few questions on the intake catalogue fields, is that's ok? Firstly, are there any fields that are auto-generated from In this branch, we are intending adding a Is In this branch, if
I could also add in the Should Sorry this has turned into a lot of questions! Just want to make sure I'm not going to screw up the |
Yeah that sounds fine I think. It is certainly more standard and 8 characters gives enough randomness I think. |
Hi @jo-basevi. Thanks for checking in. I'll do my best to answer your questions below.
Yes. Basically if any required metadata is not available within a given datastore, the catalog will look for it in the
Yes. We'll (obviously) just need to be sure that the
In the past, the
Great! The model name is not enforced by any sort of controlled vocab, but it would be good to be consistent where possible for good search functionality. There are already ACCESS-OM2 and ACCESS-ESM1-5 experiments in the catalog with the
Sounds like a good idea. Then people might fill in the fields that aren't auto-populated. It might be best to do this directly from the schema, rather than from
Just the base output directory. Let me know if you have any further questions. |
Thank you @dougiesquire for answering those questions!
I've added that the metadata file will be copied over to base archive directory.
Ok, I'll add in a default name for the metadata.
That is good to know! I won't use
Thanks, I'll look into using the schema instead. Thanks again for answering those questions! I'll be sure to ask more when I have more questions |
@aidanheerdegen since the last review I've added a few changes, with the main ones generating an experiment name on metadata.setup rather than using a set value in metadata.yaml and adding in additional metadata fields. Generating Experiment namesFor new branches or clones created using payu, it will generate a new UUID and branch-uuid experiment name. Then it will copy the metadata file over to a new archive directory. On The For development branches or branches that will be merged, could use Generating additional metadata fieldsIf a new UUID has been generated, and if there was a pre-existing UUID in metadata, it'll record the UUID of the parent experiment and also the current commit hash as suggested in this issue #387. If a new UUID has been generated and if it's a branch-uuid aware experiment, it will add the created date, experiment name, and git URL of the origin remote if it exists. This will run when If new UUID was generated and it called from a
Happy to remove that feature or improve it if there are suggestions! TODO:
|
I've added Thinking of adding the following or similar to properties:
|
I think we might have multiple definitions of "parent experiment". I could be wrong, but I think you are defining the parent experiment as the experiment configuration from which the current experiment is branched using git. I was thinking of parent experiment as the model run that provides the restart(s) for the current experiment (it's common to run a long "control" experiment and then kick off "perturbation" experiments from this run - confusingly, this is also called "branching"). Perhaps we need fields for both of these definitions? But to answer your question, yes a PR is the way to go. Feel free to suggest any other edits that you think would make the schema more generally usable. |
That is good point, thanks for raising that. I’ll bring that up with @aidanheerdegen when he comes into the office tomorrow. |
I like the idea of providing this sort of information by default to remove a (boring) burden from users. Any suggestions for how to provide this model name : intake catalogue model name mapping? |
@jo-basevi and I had a chat and agree that your characterisation is correct, and that yes a parent experiment suggests using restarts from the parent experiment. Jo pointed out we could add logic to only add a parent experiment in the case where a restart directory/experiment is specified.
The normal git commit log is sufficient for tracking the relationships between related configurations, i.e. not experiments. |
Me too. Is the |
It used to be used for some ACCESS-CM2 stuff back in the day.
We could add a specific model for Or (ad probably better), add a specific So default to capitalised |
Changes since last update:
The logic just looks if there's a metadata file and UUID defined in the parent directory containing the When
E.g.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run out of time to thoroughly check the tests I'm afraid. If there is anything in particular you want me to look at there let me know.
Most of my suggested changes are pretty minor.
Defaults to the model name, or the capitalised model name? Can we do the latter please :) |
Sorry that's bad description on my part - It currently capitalises the model name before adding it to the metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more things ... forgot to push through for the review.
This includes: - payu checkout and clone commands for interacting with branches - added a metadata class that reads/creates/updates metadata files that run in experiment initialisation and in payu checkout and clone commands - added a payu uuid command for creating new metadata files on existing branches - added git utils file with wrappers around simple git functions - added tests for added code - added type hints to branch and metadata classes
- payu branch -v, or payu branch --verbose, displays all contents of metadata for each branch vs just uuid
- Add test for listing branch metadata (payu branch) - Remove payu uuid subcommand - Add dependencies and entries points to conda meta.yaml - Add usage documentation - Abort adding metadata in checkout if branch does not have config.yaml, add error message and test - Extend payu branch to list if branch has a config file - Extend payu branch to list remote branches, added —remote flag to cmd, added test - Move chdir into a try-finally block when cloning. - Tidy test_branch.py, added functions for common tests, add tests for previous uuid
… on metadata setup. - Change metadata field names to experiment_uuid, parent_experiment - Change UUID to use built-in uuid.uuid4 - Copy metadata file over to archive direcotry - Commit metadata automatically if it's a git repository
- Add git utility methods to class so they share the same repository objects (to avoid creating multiple git repo objects) - Add created date, url and name to new experiments (unless they are pre-existing legacy experiments) - Add template values of metadata fields to new experiments created by payu checkout/clone - Update metadata and payu clone/branch usage documentation - Add parent_experiment metadata logic to use UUID of prior restart file (if defined) - If restart is passed to clone/checkout, also check if there's pre-existing restarts in archive - Add model name field to metadata - Add user and contact to metadata only for on new experiments
- sync metadata.yaml when syncing to remote archive - only update metadata when UUID has changed - add config option to disable generating UUID and metadata files - add check for staged git changes before running checkout in `payu checkout` - add metadata config for model name (if it is different from model driver name) - add documentation for metadata and runlog configuration - refactor payu branch code to return metadata dictionary when parsing commit tree
- add suggested edits to usage and config documentation - refactor list_archive_dirs function out of experiment to fsops - add parent experiment UUID as argument to payu clone - add test for fsops.list_archive_dirs - Add parent_experiment flag as an option to payu checkout
a8c0466
to
db2d0f1
Compare
@aidanheerdegen I've added a test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
This is a big change, with a lot of potential pitfalls and backwards compatibility to worry about. I think the end result is really good. The design is good and the code looks very clean and well written.
I forced merge. There were unresolved conversations, but the commits no longer existed as the commit history had been cleared up, so I couldn't mark them as resolved. |
As discussed in this issue (#330), this adds in support for branches in a control repository and storing metadata (experiment name, uuid, contact/email) in a
metadata.yaml
file. This should close #330 and #191The proposed workflow is explained by @aidan here
Experiment names
Experiment name: Name used for work and archive directories in laboratory (usually in
/scratch/
)Legacy experiment name: Control directory name or
experiment
value inconfig.yaml
Proposed new experiment names:
{CONTROL_DIR}-{UUID-SHORT}
{CONTROL_DIR}-{BRANCH_NAME}-{UUID-SHORT}
Note:
{CONTROL_DIR}
would be replaced with theexperiment
value inconfig.yaml
if it's defined.
Metadata file
Metadata file is checked or created/updated at experiment initialisation as the experiment name is used for archive/work paths in the laboratory directory. The experiment name is added to/read from metadata file: this will either be in the legacy or new name format.
A simple
metadata.yaml
could look like this:Note: If user name and email are set in git config, these will also be added to the metadata file. It won't replace existing name and email unless they are filler values like "Your Name Here" etc - This is so they will work with existing metadata.yaml templates.
Setting up metadata uuid and experiment name
If
payu setup
is run with nouuid
in metadata, it will:metadata.yaml
. This is so it'll work with pre-existing experiments.If
payu setup
is run with a pre-existing metadata file, it'll run the following checks:experiment
config has changedexperiment
config has changed.payu setup
will only commit the metadata file if runlog is set totrue
. This is to support experiments that aren't git repositories, or development branches. If usingpayu checkout
,payu clone
orpayu uuid
commands, they will commit changes to the metadata file. This is to keep track of whenuuid
has changed.New Workflow: Create new experiments from either of the following:
payu clone
without the--keep-uuid
flagpayu checkout -b NEW_BRANCH_NAME
, where-b
flag is for create a new branch- If created control directory from scratch, to generate newuuid
and experiment names, runpayu uuid
New Workflow: Continuation of pre-existing experiments - keep
uuid
unchanged:payu checkout BRANCH
payu checkout BRANCH --start-from-restart RESTART_PATH
.payu clone --keep-uuid --branch BRANCH_NAME FORKED_REPO CONTROL_DIR
.payu clone --keep-uuid --branch BRANCH_NAME --start-from-restart RESTART_PATH FORKED_REPO CONTROL_DIR
Note:
--keep-uuid
in clone command will leave metadata file unchanged ifuuid
exists. If thename of the control directory has changed, this will raise an error on checks on the experiment name.
Legacy Workflow with
git clone/branch
instead ofpayu clone/branch
:To start a new experiment (create a new uuid) but opt of branch-id-aware experiment names,run
payu uuid --legacy
.Command breakdown
`payu checkout --help`
`payu clone --help`
`payu branch --help`
Third party libs
In the code I've added dependencies on a couple third-party libraries:
Short(ish) Summary:
Only thing for older workflows that'll change is the addition of a metadata file. This will be added in any of the
payu setup/run/collate/sweep
commands. Runningpayu setup
beforepayu run
, would still be useful to run prior topayu run
catch any errors or warnings.To create and switch to a new branch - run
payu checkout -b NEW_BRANCH_NAME
To switch to an existing branch - run
payu checkout BRANCH_NAME
To list branches and metadata, run
payu branch
orpayu branch --verbose
Currently the code uses the
experiment
name that is stored in metadata file, for creating archive and work paths. The alternatives to having an experiment name in metadata and instead to auto-generate them, could be to add a flag in config to use, or not use, branch-uuid aware names. A specific version of payu could also be used with the new version of payu only using branch-uuid aware experiment names but easy access to older versions of payu isn't currently set up.TODO:
test_branch.py
but want to add some smaller unit tests)