Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows-as-programs: nextstrain run #407

Open
wants to merge 5 commits into
base: trs/grab-bag
Choose a base branch
from

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Nov 5, 2024

based on #419

Adds a new command, nextstrain run, to run (compatible) pathogen workflows workflows in a more managed way with easier update paths, without the need for user-facing Git, with support for multiple versions, and with support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

Try out setting up a pathogen and running it yourself. First, install nextstrain built from this PR:

Linux/macOS:

curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/"$(uname -s)" | bash -s pr-build/407

Windows:

Invoke-Expression "& { $(Invoke-RestMethod https://nextstrain.org/cli/installer/windows) } pr-build/407"

At the moment, the only compatible pathogen is measles at my not-yet-finished demo/prototype branch. Avian flu should not be far behind, though.

Some commands to try:

$ nextstrain setup measles@trs/workflows-as-programs
$ nextstrain version --pathogens
$ nextstrain version --pathogens --verbose
$ nextstrain run measles ingest /tmp/measles-ingest
$ tree -l /tmp/measles-ingest
$ nextstrain run measles phylogenetic /tmp/measles-phylo
$ tree -l /tmp/measles-phylo
$ nextstrain view /tmp/measles-phylo

Note that on any of the updated command documentation pages linked above you can press d to see a colorized diff against the latest non-PR version.

There's a lot of functionality (and polish) here and elsewhere still todo to fully realize the sweeping goals of workflows-as-programs, but this is a fully-usable first piece of the puzzle that can stand on its own for now.

Checklist

  • Checks pass

@huddlej
Copy link
Contributor

huddlej commented Nov 5, 2024

@tsibley It was really cool to see a working demo of this idea today! A couple of thoughts I had from your lab meeting presentation:

Providing --help documentation with the workflow

You mentioned that workflow authors will need to provide documentation for the workflow interface (config parameters, files, etc.) and (implicitly?) referenced the ncov workflow reference docs as an example of this. One pattern I liked from Snk's implementation of workflows-as-programs is the auto-generation of --help outputs for each workflow based on the config file. Snk generates an argument for each config YAML option in a workflow and the --help flag prints the list of arguments and their types as a way to expose the workflow's interface to users.

Ideally, Nextstrain users could have a similar option like nextstrain run measles --help or nextstrain help measles, etc. that exposes the workflow interface in the same way any other CLI does. I don't know where all of the help details would live in each workflow, but maybe we could discuss this general pattern?

I thought maybe you did some work with this repo to allow you to document nextstrain's interface in one place that can be exposed through --help and through the RTD website. That approach could increase the availability of that documentation without increasing the number of copies we have to maintain.

Maybe avoid overloading the setup command name?

The current prototype in your slides showed nextstrain setup measles as a way to download a copy of that Nextstrain workflow locally for use by the nextstrain run command. This syntax overloads the existing meaning of setup for runtimes (nextstrain setup conda). Since we're thinking about runtimes and workflows as separate entities (for now?), would it be clearer to provide a nextstrain workflow-setup measles interface or something like that?

If we had nextstrain workflow setup measles (this is getting verbose, though), we could also have nextstrain workflow list as a way for users to list available workflows.

Re: providing the nextstrain run interface implies a need to better version workflows

Following up on @genehack's comment at lab meeting, I don't agree that providing this run interface requires us to do better job of versioning our workflows. Your description of the interfaces suggested that users can setup workflows by commit or tag. This feels like it's just removing the manual git clone step from the user's experience. They still need to run nextstrain update measles to "upgrade" it to the latest commit, so they opt-in to new versions like they would by running git pull.

I do agree that we should do a better job of versioning our workflows independent of how users run them, though!

@tsibley
Copy link
Member Author

tsibley commented Nov 8, 2024

@huddlej

Snk generates an argument for each config YAML option in a workflow and the --help flag prints the list of arguments and their types as a way to expose the workflow's interface to users.

Yeah, I saw that in Snk. I'm not sure about auto-generating CLI options based on the workflow's config. It seems to me to be a more complicated way of providing config.yaml and one that's prone to all the issues that come from duplicating an interface into a different context. And our configs are substantial enough that I'm inclined to push users towards using config files as a best practice rather than lengthy sets of CLI overrides. I'd be interested in discussing it more, though. I can see how we'd use config schemas, which we want anyway, to generate the CLI options.

Maybe avoid overloading the setup command name?

Yeah, I've waffled on this a bit in my thinking. My old notes are full of nextstrain workflow {setup,update,run,…} hypotheticals, but my thinking changed recently. In my notes last week, I wrote:

Is it weird to have nextstrain setup and nextstrain update work for pathogens as well as runtimes? If so, we can introduce separation by nouns, e.g.

nextstrain workflow {setup,update}
nextstrain runtime {setup,update}

but if we're going to preserve backwards compat (and I really think we should), then we'd still have nextstrain {setup,update} aliased to nextstrain runtime {setup,update} for a while (maybe indefinitely) and I wonder if that sort of defeats the point.

My thinking of overloading setup and update to work for pathogens as well as runtimes comes out of

  1. preserving the existing subcommands mostly being verbs (taking nouns as arguments) as opposed to nouns (taking verbs and more nouns as arguments), and
  2. realizing it doesn't seem to add ambiguity (at least with only runtimes and pathogens as the kind of nouns): for the computer, the list of recognized runtimes is short and fixed, everything else is a pathogen; for the human, measles is obviously a pathogen, docker is obviously something else.

An exception to the verb-nouns command pattern is the remote family of commands which uses the noun-verb-nouns pattern (remote, upload, groups/foo, …). That might give some support to a workflow or pathogen subcommand with its own verbal subcommands, but it seems to me to push things too deep. Going that way, I'd also feel compelled to move setup, update, and check-setup under nextstrain runtime {setup,update,check-setup}, though that felt pointless for the reasons I wrote last week. And I'd still feel the need to have nextstrain run instead of (or in addition to) nextstrain workflow run so as to better center the core activity and match existing workflow-centric commands (view, build, etc). It all felt too messy.

So I figured trying to implementing pathogen setup/update into the existing commands would reveal if it was actually feasible or not and give more information to help make the choice.

I do think we want a way to list workflows. Haven't quite gotten there yet. I think we can do something other than nextstrain workflow ls if we don't go the nextstrain workflow route.

I don't agree that providing this run interface requires us to do better job of versioning our workflows. […] I do agree that we should do a better job of versioning our workflows independent of how users run them, though!

+1!

@huddlej
Copy link
Contributor

huddlej commented Nov 8, 2024

Thanks, @tsibley!

I'm not sure about auto-generating CLI options based on the workflow's config.

That's fair. My main hope was that we could include the documentation for the config in the interface somehow, thinking that it's nicer for users if they don't have to leave the command line for a website to figure out how the program works. The autogenerated CLI is just an example of one approach, but I'm open to whatever achieves the end goal.

I figured trying to implementing pathogen setup/update into the existing commands would reveal if it was actually feasible or not and give more information to help make the choice.

Yeah, I can see how the interface gets complicated fast when you start to bifurcate on workflow vs. runtime. As you test this with people, I'd be most interested in whether external users experience any confusion when trying to update a runtime vs. a workflow. It's obvious to us that these are separate nouns, but if users rely on the verb to signal the noun that's being operated on, it could get confusing.

@genehack
Copy link
Contributor

genehack commented Nov 8, 2024

Thinking out loud about implementation edge cases around fetching workflows:

  • For the workflow fetching — whether that's setup or something else¹ — is the idea that a "bare" noun would always get https://github.com/nextstrain/ prepended?
  • What does it look like if somebody wants to fork measles into their own org and then fetch that? (Options I see are "if there's a slash in the noun, only prepend https://github.com" and "if it's not a 'bare' noun it has to be a fully specified URL that will work with git clone")
  • What does the on-disk storage look like if somebody forks measles into their own organization but wants to nextstrain setup measles AND nextstrain setup my-org/measles?

¹ registering my (probably minority) vote for "something else"; having one verb do two completely different things seems like a bad idea to me…

@tsibley
Copy link
Member Author

tsibley commented Nov 13, 2024

@genehack— Good questions, I've thought thru these previously (though not articulated them outside of my notes yet) and so have some answers. That said, I welcome others to think thru them too!

For the workflow fetching — whether that's setup or something else¹ — is the idea that a "bare" noun would always get https://github.com/nextstrain/ prepended?

I anticipate the constructed source URL would not be exactly that (see below), but in concept, yes.

What does it look like if somebody wants to fork measles into their own org and then fetch that? (Options I see are "if there's a slash in the noun, only prepend https://github.com" and "if it's not a 'bare' noun it has to be a fully specified URL that will work with git clone")

If it's not a bare word ("no slashes"), I'm inclined to require it be a URL for maximum explicitness/self-explanation to a reader, but I'd consider the "single slash" case of user/repo to mean GitHub. I haven't been sure about it and would welcome thoughts on it. Beyond favoring explicitness, I also don't like to favor a single commercial service like that. Comparisons to consider are how nextstrain remote works (explicit URLs are required; actually even for nextstrain.org when it's not groups/…) and nextclade dataset (canonical name is always scope/name, e.g. nextstrain/measles or community/my-org/measles, but aliases allow just measles to mean ours).

Notably, I don't plan to run git because I don't want to require it's installed separately. And I don't think there's a no-deps git reimplementation available for Python that I could pick up today and use. I know of a couple projects (in Rust, in Go) that I could pick up and use for that purpose, but then I'd also have to do the work to bundle them correctly into Nextstrain CLI. I don't really feel like doing that right now. I guess I could use our own runtimes to run git, but that feels… weird? Also, requiring the URL be a Git repo feels wrong in and of itself; too "techy".

Instead, my plan is to have the fully-resolved URL return a ZIP, while leaving it open to support more container formats (e.g. tarballs, Git, etc.) in the future.

What does the on-disk storage look like if somebody forks measles into their own organization but wants to nextstrain setup measles AND nextstrain setup my-org/measles?

My thinking is that nextstrain setup measles expands behind the scenes to something like

nextstrain setup measles=https://api.github.com/repos/nextstrain/measles/zipball
nextstrain setup measles --using https://api.github.com/repos/nextstrain/measles/zipball
nextstrain setup https://api.github.com/repos/nextstrain/measles/zipball --as measles

That is, nextstrain setup not just downloads stuff but is also is the conceptual point where you assign a local name to the pathogen/collection of workflows. So in your example, you'd assign a name to each, e.g. maybe

nextstrain setup measles-nextstrain=measles
nextstrain setup measles-my-org=my-org/measles

Trying to assign both the same name would be an error.

In most common usage, of course, the name will be defaulted (i.e. measles alone will serve as both the name and implicit URL).

Current on-disk storage plan is something like:

~/.nextstrain/pathogen/<name>@<versionX>/
~/.nextstrain/pathogen/<name>@<versionY>/

The "current"/default version to use will be stored in ~/.nextstrain/config, à la runtimes. Unlike runtimes (at least as it stands now), you'll have the option to keep old versions around instead of pruning them on nextstrain update.

Names and versions and full URLs of installed pathogens will be reported in nextstrain version --verbose.

You could set up and use specific versions like:

nextstrain setup measles@v3
nextstrain run measles@v3 …

This probably means you could handle the nextstrain/measles and my-org/measles case a different way if you wanted:

nextstrain setup measles@v3
nextstrain setup measles@my-org=my-org/measles

¹ registering my (probably minority) vote for "something else"; having one verb do two completely different things seems like a bad idea to me…

Nod. The way I see it is that it does two completely different things under the hood, but to users who don't know/care about the implementation, it does the same thing: sets up X for use. I don't think users will care if X is a "runtime" or a "pathogen"; it's all just how they get bits of Nextstrain ready for use on their computer. (And at some point, we may blur the lines between "runtime" and "pathogen" if we start bundling the runtimes into the pathogen to support pathogen-specific arbitrary deps. This is the other half of "workflows as programs", though it may not come to pass.)

And if we do find it's confusing to users, we can do "something else" later, but if we do "something else" now, it'd be hard to put the horse back in the barn later.

(And also, it still may be that I end up finding it too confusing to explain once I spend more time making it a reality.)

⁂ ⁂ ⁂

There's much more to write about that I've noodled on, but this feels like a good stopping point to let some discussion happen.

@tsibley
Copy link
Member Author

tsibley commented Nov 13, 2024

@huddlej

…thinking that it's nicer for users if they don't have to leave the command line for a website to figure out how the program works.

I'm very sympathetic to this, but I suspect that while you or I would like to not have to leave the command line to figure out how a program works, we a) are in the minority these days by far, especially among our users and b) already leave the command line for the web browser all the gd time because getting high-quality information (beyond basic reference material) available on the command line is Hard.

(Don't tempt me with the good time of arranging to render our rST docs to nroff for use with man. Or rendering to Python's rich/textual libraries. It's high-effort high-polish I'd love, but that I'm highly dubious will be appreciated more broadly. Maybe you want to convince me otherwise‽)

…but if users rely on the verb to signal the noun that's being operated on, it could get confusing.

Totally. My thinking is that many (most?) users will do the one-time setup of a runtime and a pathogen or two by closely following docs, where understanding the nuance between the nouns is not crucial nor important, and then periodically run nextstrain update without qualification to update the runtime and all the pathogens together without thinking much of it.

@genehack
Copy link
Contributor

Heavily redacting to just comment on the bits I want to comment on 😁

If it's not a bare word ("no slashes"), I'm inclined to require it be a URL for maximum explicitness/self-explanation to a reader, but I'd consider the "single slash" case of user/repo to mean GitHub. I haven't been sure about it and would welcome thoughts on it. Beyond favoring explicitness, I also don't like to favor a single commercial service like that.

Yep, I would be in favor of requiring a full URL for exactly that reason.

Instead, my plan is to have the fully-resolved URL return a ZIP, while leaving it open to support more container formats (e.g. tarballs, Git, etc.) in the future.

I hear you about the lack of a decent Git implementation to do the work, but I just tested the zipball download from Github and the thing that gets downloaded does not expand into a Git repo, and there's a tiny voice in the back of my head that says we're gonna end up regretting that.

Also means that the nextstrain setup usage pattern would not be something we ourselves would use all that often, maybe not at all.

The tiny voice is pretty quiet so maybe it's okay to ignore it.

Current on-disk storage plan is something like:

~/.nextstrain/pathogen/<name>@<versionX>/
~/.nextstrain/pathogen/<name>@<versionY>/

...and by ~/.nextstrain you mean $NEXTSTRAIN_HOME, right? 😁 (says the guy that hates things buried in dot-directories and has export NEXTSTRAIN_HOME=/opt/nextstrain-env set...

¹ registering my (probably minority) vote for "something else"; having one verb do two completely different things seems like a bad idea to me…

Nod. The way I see it is that it does two completely different things under the hood, but to users who don't know/care about the implementation, it does the same thing: sets up X for use.

Sure, I get it — but there's a point where "ease of use" crosses into "obscuring actual functional differences", and for me this is on the wrong side of that line.

This is probably the same tiny voice as before though...

There's much more to write about that I've noodled on, but this feels like a good stopping point to let some discussion happen.

++1

@tsibley
Copy link
Member Author

tsibley commented Nov 14, 2024

Heavily redacting to just comment on the bits I want to comment on 😁

Sounds ████████, thanks.

Yep, I would be in favor of requiring a full URL for exactly that reason.

👍

the thing that gets downloaded does not expand into a Git repo, and there's a tiny voice in the back of my head that says we're gonna end up regretting that.

Hmm. Can you say more about why you think we'd regret that? I get the inclination, but I've found myself unable to identify a concrete reason we'd want/need these copies to actually be Git repos.

Also means that the nextstrain setup usage pattern would not be something we ourselves would use all that often, maybe not at all.

Why not? I don't follow. There's nothing preventing the use of a Git repo at $NEXTSTRAIN_HOME/pathogen/<name>@<versionX>/ (i.e. the presence of .git/ there won't break anything), so we could manually symlink/git clone/git worktree things into there if we wanted.

And one usage pattern I didn't mention here yet (though did in convos with Jover) would be potentially supporting that symlink/git clone/git worktree approach for local development, e.g. nextstrain setup measles=. from within measles/, akin to pip install -e ., or nextstrain setup measles=~/nextstrain/measles/ from elsewhere. This is totally doable. I have reservations about doing it though, since ln -s is sitting right there. But if setup involves a few other small steps or it really is more convenient/explainable, it might make sense to officially support it. (Esp. for external workflow authors.)

...and by ~/.nextstrain you mean $NEXTSTRAIN_HOME, right? 😁

o(f)c. :-P

Nod. The way I see it is that it does two completely different things under the hood, but to users who don't know/care about the implementation, it does the same thing: sets up X for use.

Sure, I get it — but there's a point where "ease of use" crosses into "obscuring actual functional differences", and for me this is on the wrong side of that line.

Nod. Hmm. The functional differences being "makes --docker or --conda usable" vs. "makes nextstrain run usable"? Or something else?

I'm sympathetic to not wanting to use nextstrain setup for both, but given the alternatives I've thought thru and their implications, it's the best choice IMO. If you can suggest some other alternative (or reasoning that attenuates the downsides of existing identified alternatives) that's compelling, I'd be interested and open to it!

@genehack
Copy link
Contributor

the thing that gets downloaded does not expand into a Git repo, and there's a tiny voice in the back of my head that says we're gonna end up regretting that.

Hmm. Can you say more about why you think we'd regret that? I get the inclination, but I've found myself unable to identify a concrete reason we'd want/need these copies to actually be Git repos.

As soon as somebody wants to collaborate with us on something, we're gonna want the data in Git. I agree that it might not be a requirement but there are a lot of troubleshooting things that are going to be way harder, I suspect. (I.e., somebody contacts us, they nextstrain setupd something, made some changes/added some data, it was working for a while, now it's broken — how is our first question NOT "what's the diff between the original and what you've got now?", or something similar, which will be much easier to ask even if they haven't been checking their changes in.

Also means that the nextstrain setup usage pattern would not be something we ourselves would use all that often, maybe not at all.

Why not? I don't follow. There's nothing preventing the use of a Git repo at $NEXTSTRAIN_HOME/pathogen/<name>@<versionX>/ (i.e. the presence of .git/ there won't break anything), so we could manually symlink/git clone/git worktree things into there if we wanted.

But if you're manually symlinking/git clone/whatevering, aren't you by definition not using nextstrain setup? You're just sticking the data under $NEXTSTRAIN_HOME yourself.

Sure, I get it — but there's a point where "ease of use" crosses into "obscuring actual functional differences", and for me this is on the wrong side of that line.

Nod. Hmm. The functional differences being "makes --docker or --conda usable" vs. "makes nextstrain run usable"? Or something else?

The difference being runtimes aren't datasets, and I don't see any gain we get out of obscuring that distinction.

I'm sympathetic to not wanting to use nextstrain setup for both, but given the alternatives I've thought thru and their implications, it's the best choice IMO. If you can suggest some other alternative (or reasoning that attenuates the downsides of existing identified alternatives) that's compelling, I'd be interested and open to it!

Yeah, I don't think I have anything better than everything you've already decided against. 🤷 nextstrain install-dataset, maybe?

@tsibley
Copy link
Member Author

tsibley commented Nov 21, 2024

they nextstrain setupd something, made some changes/added some data, it was working for a while, now it's broken — how is our first question NOT "what's the diff between the original and what you've got now?"

Ah! I think I see our parity mismatch now: you're expecting nextstrain setup to be how someone gets a pathogen to hack on. I'm expecting it to be how someone gets a pathogen to use with nextstrain run (and nextstrain update). A "managed" pathogen; sort of a "no user-serviceable parts inside" thing (although, of course, we wouldn't prevent tinkering). This is why nextstrain setup puts things "hidden" inside of $NEXTSTRAIN_HOME instead of a chosen location like git clone does. It'd be akin to the Conda or Singularity runtimes, which maintain data in $NEXTSTRAIN_HOME that they expect to have full management/control over (but which nothing stops you from tinkering with).

If someone's at the "hacking on the source" stage of usage, I fully expect them to be managing their own git clone outside of nextstrain setup.

The difference being runtimes aren't datasets, and I don't see any gain we get out of obscuring that distinction.

In my thinking, the user-visible distinction of runtimes vs. pathogens/workflows exists solely as an accidental/incidental implementation detail. Ideally, the whole runtime thing would be subsumed into pathogens/workflows and while it wouldn't be hidden from users, it'd need not be so visible.

I guess put in your terms, I see it the other way: what do we gain from highlighting the (current) distinction?

(Also, I think we're talking about "Nextstrain pathogens", not "datasets".)

aren't you by definition not using nextstrain setup?

Yes, sorry, I elided some thinking there where nextstrain setup stood in for the larger thing it enables: nextstrain run. I think it's more important for us to be routinely using nextstrain run, compared to nextstrain setup, and so the thinking is that we'd manually imitate nextstrain setup sometimes in order to use nextstrain run.

I expect that for development (incl. probably CI), we'd be using either a) nextstrain build or b) nextstrain run + manually-imitated-nextstrain setup. However, I also expect that we'd still routinely use nextstrain run + nextstrain setup for ad-hoc usage of pathogen workflows outside of development.

And like I said, if we didn't want to manually-imitate nextstrain setup, we could have it explicitly support the symlink/git clone/git worktree approach for local development. (Yes, this is not the exact same as exercising the "download a ZIP file" code path, but I don't think it would go completely unexercised amongst our use.)

Yeah, I don't think I have anything better than everything you've already decided against.

How about any reasons why the downsides I described for those aren't actually as downside-y as I think they are?

nextstrain install-dataset, maybe?

I'd considered nextstrain install, but rejected it because having both a setup and install command felt like having an upgrade and update command. Too nuanced a difference, just confusing.

Qualifying it with -dataset (or really, -pathogen?) helps mitigate that, but it feels a) too much like splitting hairs while b) sharing the problems of nextstrain pathogen setup.

@victorlin
Copy link
Member

Adding another option for command naming: nextstrain pull. Analagous to docker pull, it could be used for both setup and update/downgrade.

@tsibley
Copy link
Member Author

tsibley commented Nov 22, 2024

Adding another option for command naming: nextstrain pull. Analagous to docker pull, it could be used for both setup and update/downgrade.

Hmm. Intriguing. I think pull only works slightly more than install because it feels semantically further from setup. But I doubt the meaning of "pull" is as widely understood as "install" or "setup".

Note that I do want to support not just a single version that you can freely upgrade/downgrade, but multiple concurrent versions.

@victorlin
Copy link
Member

victorlin commented Dec 4, 2024

Re: install, our docs are currently phrased so that "installing Nextstrain" encompasses installing the CLI as well as installing other tools through setup of a runtime. I don't think install/setup/pull of a pathogen repo should be considered part of "installing Nextstrain". If we choose nextstrain install or something similar to represent installation of a pathogen repo, we would need to reword docs to clarify what "install" means. I can see this causing confusion for users.

We already use other technical terms that are not widely understood (e.g. "runtime", "build"), so I don't see the new terminology "pull" as much of a negative.

@tsibley
Copy link
Member Author

tsibley commented Dec 4, 2024

Yes, agreed we shouldn't use nextstrain install.

I don't think install/setup/pull of a pathogen repo should be considered part of "installing Nextstrain".

Ah, I don't think I agree. I think for many people "installing Nextstrain" would naturally include a pathogen of interest to them.

We have a nuanced understanding of all the bits and bobs that are "Nextstrain"¹, but experience suggests to me that most users do not. For example, we very often see folks who (understandably) conflate "Nextstrain" with some specific part of it, e.g. Nextclade. IMO, they shouldn't have to care much about it a lot of the time.

¹ and try to communicate some of that in parts of a whole

We already use other technical terms that are not widely understood (e.g. "runtime", "build"), so I don't see the new terminology "pull" as much of a negative.

Those existing terms are often sources of confusion and require explanation. It'd be nice not to add to that if we have other options.

@victorlin
Copy link
Member

Interesting points. I see what you mean and it makes sense under the "workflows as programs" phrasing, but it's a new perspective for me. I'll let the thoughts marinate...

@tsibley
Copy link
Member Author

tsibley commented Dec 5, 2024

but it's a new perspective for me. I'll let the thoughts marinate...

👍 🧑‍🍳

To followup on my own statement:

Yes, agreed we shouldn't use nextstrain install.

At least, shouldn't use both nextstrain install and nextstrain setup as separate and different commands. I'd be fine with some sort of migration from the latter to the former if we think that would better match user intuitions and the way we talk about it in documentation elsewhere.

@tsibley tsibley force-pushed the trs/workflows-as-programs branch 2 times, most recently from c2cd0f6 to 5873eb4 Compare December 19, 2024 20:34
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from 5873eb4 to c8d3b82 Compare January 16, 2025 07:31
@tsibley tsibley force-pushed the trs/workflows-as-programs branch 2 times, most recently from fd602f9 to 16c42f1 Compare February 4, 2025 04:38
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from 16c42f1 to b529975 Compare February 11, 2025 06:39
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from b529975 to 315895f Compare February 13, 2025 21:30
@tsibley tsibley force-pushed the trs/workflows-as-programs branch 3 times, most recently from 06067b9 to 92dfab9 Compare March 4, 2025 18:48
@tsibley tsibley force-pushed the trs/workflows-as-programs branch 3 times, most recently from e4ff937 to c6e8a47 Compare March 5, 2025 23:16
@tsibley tsibley changed the base branch from master to trs/grab-bag March 6, 2025 18:28
@tsibley tsibley changed the title wip! workflows as programs prototyping Workflows-as-programs: nextstrain run Mar 6, 2025
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from c6e8a47 to e788489 Compare March 6, 2025 20:21
@tsibley tsibley marked this pull request as ready for review March 6, 2025 20:27
@tsibley tsibley requested a review from a team March 6, 2025 20:28
@tsibley
Copy link
Member Author

tsibley commented Mar 6, 2025

Ready for review! I updated the PR description with information for review. The latest CI run is still in-progress but only because of minor changes so I expect it to pass.

@tsibley
Copy link
Member Author

tsibley commented Mar 6, 2025

Never a dull moment: I'm seeing the standalone installation process I describe in the PR description above error out like this

$ cd "$(mktemp -d)"
$ curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/"$(uname -s)" | NEXTSTRAIN_HOME=$PWD bash -s pr-build/407
[…]
--> Checking installation
Traceback (most recent call last):
  File "runpy", line 196, in _run_module_as_main
  File "runpy", line 86, in _run_code
  File "nextstrain.cli.__main__", line 43, in <module>
  File "nextstrain.cli.__main__", line 19, in main
  File "nextstrain.cli", line 34, in run
  File "argparse", line 1826, in parse_args
  File "argparse", line 1859, in parse_known_args
  File "argparse", line 2072, in _parse_known_args
  File "argparse", line 2012, in consume_optional
  File "argparse", line 1936, in take_action
  File "nextstrain.cli", line 91, in __call__
  File "nextstrain.cli.command.version", line 42, in run
AttributeError: 'types.SimpleNamespace' object has no attribute 'runtimes'
ERROR: installation check failed: unable to run `"/tmp/tmp.LNam0JJ0bw/cli-standalone"/nextstrain --version`; look for error message above.

which is, uh, a bit baffling right now. The installed files are there, however, and seem to run just fine

$ ./cli-standalone/nextstrain version
Nextstrain CLI 8.5.4+git.8f23baa (standalone)

adding to my confusion about what the heck is happening.

@tsibley
Copy link
Member Author

tsibley commented Mar 6, 2025

OH! It's running nextstrain --version not nextstrain version. The former fails, the latter works. 🤦 I'll fix this.

@tsibley tsibley force-pushed the trs/workflows-as-programs branch from e788489 to fa895c7 Compare March 6, 2025 20:54
@tsibley
Copy link
Member Author

tsibley commented Mar 6, 2025

Fixed. There's no harm in installing the PR build before this round of CI completes though; it's just that nextstrain --version won't work.

Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only tested out nextstrain run which works nicely for my (off the happy-path) purposes using the master branch of avian-flu and a from-source pip install of nextstrain-cli:

mkdir -p ~/.nextstrain/pathogens/avian-flu
ln -sv ~/github/nextstrain/avian-flu ~/.nextstrain/pathogens/avian-flu/local=NRXWGYLM
cd <some analysis directory>
# add a `config.yaml` overlay (optional)
envdir ... nextstrain run --aws-batch avian-flu@local segment-focused . auspice/avian-flu_h5n1_pb2_2y.json

Using --augur ~/github/nextstrain/augur (as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache etc).

Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused more on the documentation and comments than on the code; others are (hopefully!) better positioned to review the code.

My big comments:

  • I think overloading the word "pathogen" to mean "check out of a pathogen repo" causes unnecessary cognitive load when reading the documentation, particularly for folks outside of Nextstrain — using something like pathogen install, pathogen installation, pathogen instance, or some other variation, would be easier to understand
  • Similarly, the documentation leans pretty heavily into "set up" (verb) and "setup" (noun), which I think is potentially a source of confusion, especially for folks less familiar with the subtleties of English — picking a different word for one of them (e.g., "install" for the verb or "installation" for the noun, or BYOword) would require less careful parsing of the docs.

That said, I realize this ship has probably sailed, but I needed to point those out as probable issues.

@@ -63,11 +67,11 @@ commands

.. option:: update

Update a runtime. See :doc:`/commands/update`.
Update a pathogen or runtime. See :doc:`/commands/update`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; suggested for consistency with other phrasing in the doc

Suggested change
Update a pathogen or runtime. See :doc:`/commands/update`.
Update a pathogen workflow or runtime. See :doc:`/commands/update`.


.. option:: setup

Set up a runtime. See :doc:`/commands/setup`.
Set up a pathogen or runtime. See :doc:`/commands/setup`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; suggested for consistency with other phrasing in the doc

Suggested change
Set up a pathogen or runtime. See :doc:`/commands/setup`.
Set up a pathogen workflow or runtime. See :doc:`/commands/setup`.

(mainly provided by Nextstrain) using your own configuration, data, and other
supported customizations. Pathogens are initially set up using `nextstrain
setup` and can be updated over time as desired using `nextstrain update`.
Multiple versions of a pathogen may be set up and run independently without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Multiple versions of a pathogen may be set up and run independently without
Multiple versions of a pathogen workflow may be set up and run independently without


.. option:: <pathogen-name>[@<version>]

The name (and optionally, version) of a previously set up pathogen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stop making this correction/suggestion now, but in general I think, if you want a bare, generic, un-adjective-d noun, "workflow" would be a better choice than "pathogen"; otherwise, use "pathogen repo" or "pathogen workflow".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try not to get into naming discussions, but I think "(pathogen) workflow" is going to be used to refer to the Snakemake workflow within a pathogen (repo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my recent comment below. These could all become "pathogen repository".

(Note that "pathogen workflow" is not quite right. There is an important distinction between the pathogen repo and the workflows it contains.)

Copy link
Member Author

@tsibley tsibley Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There may be places where that distinction isn't made clearly enough, and those are my mistake and should be fixed.)

Comment on lines +51 to +54
A pathogen may also be fully-specified as ``<name>@<version>=<url>``
where ``<name>`` and ``<version>`` in this case are (mostly)
arbitrary and ``<url>`` points to a ZIP file containing the
pathogen workflow contents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an example of what url points to a ZIP file looks like might be helpful here?

@tsibley tsibley force-pushed the trs/workflows-as-programs branch 2 times, most recently from 86758ed to 79a25f6 Compare March 10, 2025 19:09
@tsibley tsibley force-pushed the trs/grab-bag branch 2 times, most recently from 1a27583 to 93ae929 Compare March 10, 2025 19:10
@tsibley tsibley force-pushed the trs/workflows-as-programs branch 2 times, most recently from a6dac89 to 100ecec Compare March 10, 2025 19:28
@tsibley
Copy link
Member Author

tsibley commented Mar 10, 2025

I really appreciate the 👀 and 🧠 on this so far. Thank you!

@jameshadfield wrote:

Only tested out nextstrain run which works nicely for my (off the happy-path) purposes using the master branch of avian-flu and a from-source pip install of nextstrain-cli:

Glad to hear it!

Using --augur ~/github/nextstrain/augur (as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache etc).

Ugh. Compressing archive members during upload (#295) will only go so far here, and there's other considerations to address with that too. The best thing would be to, like you say, avoid uploading the cruft in the first place. I don't want to hardcode paths here, though. Perhaps runner.aws_batch.s3.upload_workdir needs to be extended to a) use .gitignore files for the overlay volumes (but not the workdir itself) and/or b) support a Nextstrain-specific ignore file (e.g. .nextstrain-ignore) which could be applied everywhere (as suggested in 6d465f0).

@genehack wrote:

I focused more on the documentation and comments than on the code; others are (hopefully!) better positioned to review the code.

Frankly, I'm less concerned about the code (although still appreciate 👀) and more about the interface and docs and QA aspects, so this is great. 🙌

My big comments:

  • I think overloading the word "pathogen" to mean "check out of a pathogen repo" causes unnecessary cognitive load when reading the documentation, particularly for folks outside of Nextstrain — using something like pathogen install, pathogen installation, pathogen instance, or some other variation, would be easier to understand

Yeah, that's fair. "Pathogen" in the docs is short for "pathogen repository", but that felt too much of a mouthful in most instances so I ended up shortening it. That said, "pathogen repository" is the terminology we use in the rest of our documentation, so it should probably also be used here as much as possible. If that sounds good, I can make the updates.

(I shy away from "pathogen install" because this project doesn't use that terminology, as you well know; see more commentary below.)

  • Similarly, the documentation leans pretty heavily into "set up" (verb) and "setup" (noun), which I think is potentially a source of confusion, especially for folks less familiar with the subtleties of English — picking a different word for one of them (e.g., "install" for the verb or "installation" for the noun, or BYOword) would require less careful parsing of the docs.

Nod. This is me sticking with the existing convention in this project. And yes, that convention was established by me 🙃 but not with this PR. Certainly this PR adds several more usages though, and maybe that tips the scales towards doing something about it sooner than later.

I agree the distinction is a subtlety of English, but I'm not sure I see how "set up" vs. "setup" is all that confusing in practice as a doc reader. Can you point to places where it seems likely to cause specific confusion?

It seems to me like a distinction that will be glossed over by most, which is fine. Perhaps that points to not making the distinction English does and using "setup" everywhere?

Or, as I mentioned earlier in this PR, I'd be ok with some sort of migration from nextstrain setupnextstrain install and the corresponding prose shift if we think "install" would better match user intuitions and the way we talk about it in documentation elsewhere.

That said, I realize this ship has probably sailed, but I needed to point those out as probable issues.

It may have sailed… but this isn't the 1700s and we can send it a wireless telegram to change course while it's at sea. ;-) That's to say, I'm open to reconsidering terminology here. I don't think this feature/PR needs to block on "install" vs. "setup", though, since "setup" is already existing terminology.

Discussion on terminology may be most productive synchronously. Maybe we can take it to the weekly dev chat?

@genehack
Copy link
Contributor

Narrowing the scope of this reply to just the terminology stuff:

  • I think overloading the word "pathogen" to mean "check out of a pathogen repo" causes unnecessary cognitive load when reading the documentation, particularly for folks outside of Nextstrain — using something like pathogen install, pathogen installation, pathogen instance, or some other variation, would be easier to understand

Yeah, that's fair. "Pathogen" in the docs is short for "pathogen repository", but that felt too much of a mouthful in most instances so I ended up shortening it.

TBH, the reason "pathogen repository" (and/or "pathogen repo") wasn't in my list of suggested alternatives was because I felt like that phrasing runs counter to the "this tooling is so people don't need to know about Git" imperative that seems to be partially driving the work.

Perhaps we can abbreviate most usages of it as "PR", surely that won't cause confusion… 😛

That said, "pathogen repository" is the terminology we use in the rest of our documentation, so it should probably also be used here as much as possible. If that sounds good, I can make the updates.

That would probably assuage my concerns, assuming it doesn't make the docco feel too …something.

  • Similarly, the documentation leans pretty heavily into "set up" (verb) and "setup" (noun)

[snip]

I agree the distinction is a subtlety of English, but I'm not sure I see how "set up" vs. "setup" is all that confusing in practice as a doc reader. Can you point to places where it seems likely to cause specific confusion?

Aside from the one place where I suggested correcting something that was actually a correct usage, because I'd momentarily lost track of which one was which, I can't point to any problematic places. I do note it makes reading the docs slower for me (maybe not a bad thing?) because I have to stop at each use of "set up" or "setup" and briefly figure out whether it's a thing or an action. (That might be specific to my brain, I dunno.)

Discussion on terminology may be most productive synchronously. Maybe we can take it to the weekly dev chat?

Happy to do that if we need to keep talking about it! (But also sometimes, when dealing with English language usage issues, it's maybe more clear textually; spoken forms of "set up" and "setup" are even more confusable…)

@tsibley
Copy link
Member Author

tsibley commented Mar 11, 2025

TBH, the reason "pathogen repository" (and/or "pathogen repo") wasn't in my list of suggested alternatives was because I felt like that phrasing runs counter to the "this tooling is so people don't need to know about Git" imperative that seems to be partially driving the work.

Heh. Yeah, that was also part of my "well, I'll just shorten it to pathogen" thinking. Especially since we talk about "Nextstrain pathogens" and otherwise drop the "repository" part all the time, even with external folks.

That said, I think "repository" is common enough in science (e.g. sample repository, data repositories) to have enough relatable meaning outside of "version control repository". So I don't feel opposed to using it, esp. since I see we do use it more consistently in documentation.

Aside from the one place where I suggested correcting something that was actually a correct usage, because I'd momentarily lost track of which one was which, I can't point to any problematic places. I do note it makes reading the docs slower for me (maybe not a bad thing?) because I have to stop at each use of "set up" or "setup" and briefly figure out whether it's a thing or an action. (That might be specific to my brain, I dunno.)

Nod. I'm ok with slower doc reading! (But I also often find myself iteratively re-reading bits of doc to make sure I'm grasping the details/nuances/omissions/etc… and I'm ok with that, I guess? kind of a basic textual analysis.)

For now, I'm inclined to keep "setup" vs. "set up" terminology as it stands (minus an error or two I spotted), with the understanding that changing it would be part of the larger "setup" vs. "install" (vs. ??????) discussion that's been simmering.

Happy to do that if we need to keep talking about it! (But also sometimes, when dealing with English language usage issues, it's maybe more clear textually; spoken forms of "set up" and "setup" are even more confusable…)

Fair! I'm ok with text too. Whatever floats fills people's boats sails here re: continued discussion and in what form and on what timeline.

tsibley added 5 commits March 10, 2025 22:20
…version_lax()

The lax implementation is always successful at producing a Version
object suitable for comparisons, even if the version isn't a valid
Python package version.  This makes it useful for non-Python ecosystems,
such as Conda packages, and is what I want to use for pathogen versions
too.

Renames the strict parse_version() previously present in util to
parse_version_strict() because it raises an exception on invalid
versions.  It is still only used internally in util to parse versions
that we expect to be valid.
…ormation

Preserves 1) the original version string and 2) whether
parse_version_lax() found it PEP-440-compliant or not.

The original version string is useful even for compliant versions, as
stringifying a Version object will return a normalized representation
that may not string compare identically.
They're going to be used for not just runners but also pathogens, so
remove "runner" from their names.
…ners themselves

Anticipates introduction of pathogens as a thing that can be setup too,
which will handle their own default setting.
Adds a new command, `nextstrain run`, to run (compatible) pathogen
workflows in a more managed way with easier update paths, without the
need for user-facing Git, with support for multiple versions, and with
support for concurrent-but-separate analyses via the same workflow.

Supported by changes to

  - `nextstrain setup` to obtain and set up specific versions of pathogens
  - `nextstrain update` to keep pathogens up-to-date
  - `nextstrain version` to report on pathogen versions available locally

At the moment, the only compatible pathogen is measles at my
not-yet-finished demo/prototype branch.¹  Avian flu should not be far
behind, though.

There's a lot of functionality (and polish) here and elsewhere still
todo to fully realize the sweeping goals of workflows-as-programs², but
this is a fully-usable first piece of the puzzle that can stand on its
own for now.

¹ <nextstrain/measles#55>
² <nextstrain/public#1>
@tsibley tsibley force-pushed the trs/workflows-as-programs branch from 100ecec to 228fd6c Compare March 11, 2025 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants