-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows-as-programs: nextstrain run
#407
base: trs/grab-bag
Are you sure you want to change the base?
Conversation
@tsibley It was really cool to see a working demo of this idea today! A couple of thoughts I had from your lab meeting presentation: Providing
|
Yeah, I saw that in Snk. I'm not sure about auto-generating CLI options based on the workflow's config. It seems to me to be a more complicated way of providing
Yeah, I've waffled on this a bit in my thinking. My old notes are full of
My thinking of overloading
An exception to the verb-nouns command pattern is the So I figured trying to implementing pathogen setup/update into the existing commands would reveal if it was actually feasible or not and give more information to help make the choice. I do think we want a way to list workflows. Haven't quite gotten there yet. I think we can do something other than
+1! |
Thanks, @tsibley!
That's fair. My main hope was that we could include the documentation for the config in the interface somehow, thinking that it's nicer for users if they don't have to leave the command line for a website to figure out how the program works. The autogenerated CLI is just an example of one approach, but I'm open to whatever achieves the end goal.
Yeah, I can see how the interface gets complicated fast when you start to bifurcate on workflow vs. runtime. As you test this with people, I'd be most interested in whether external users experience any confusion when trying to update a runtime vs. a workflow. It's obvious to us that these are separate nouns, but if users rely on the verb to signal the noun that's being operated on, it could get confusing. |
Thinking out loud about implementation edge cases around fetching workflows:
¹ registering my (probably minority) vote for "something else"; having one verb do two completely different things seems like a bad idea to me… |
@genehack— Good questions, I've thought thru these previously (though not articulated them outside of my notes yet) and so have some answers. That said, I welcome others to think thru them too!
I anticipate the constructed source URL would not be exactly that (see below), but in concept, yes.
If it's not a bare word ("no slashes"), I'm inclined to require it be a URL for maximum explicitness/self-explanation to a reader, but I'd consider the "single slash" case of Notably, I don't plan to run Instead, my plan is to have the fully-resolved URL return a ZIP, while leaving it open to support more container formats (e.g. tarballs, Git, etc.) in the future.
My thinking is that
That is,
Trying to assign both the same name would be an error. In most common usage, of course, the name will be defaulted (i.e. Current on-disk storage plan is something like:
The "current"/default version to use will be stored in Names and versions and full URLs of installed pathogens will be reported in You could set up and use specific versions like:
This probably means you could handle the
Nod. The way I see it is that it does two completely different things under the hood, but to users who don't know/care about the implementation, it does the same thing: sets up X for use. I don't think users will care if X is a "runtime" or a "pathogen"; it's all just how they get bits of Nextstrain ready for use on their computer. (And at some point, we may blur the lines between "runtime" and "pathogen" if we start bundling the runtimes into the pathogen to support pathogen-specific arbitrary deps. This is the other half of "workflows as programs", though it may not come to pass.) And if we do find it's confusing to users, we can do "something else" later, but if we do "something else" now, it'd be hard to put the horse back in the barn later. (And also, it still may be that I end up finding it too confusing to explain once I spend more time making it a reality.) ⁂ ⁂ ⁂ There's much more to write about that I've noodled on, but this feels like a good stopping point to let some discussion happen. |
I'm very sympathetic to this, but I suspect that while you or I would like to not have to leave the command line to figure out how a program works, we a) are in the minority these days by far, especially among our users and b) already leave the command line for the web browser all the gd time because getting high-quality information (beyond basic reference material) available on the command line is Hard. (Don't tempt me with the good time of arranging to render our rST docs to nroff for use with
Totally. My thinking is that many (most?) users will do the one-time setup of a runtime and a pathogen or two by closely following docs, where understanding the nuance between the nouns is not crucial nor important, and then periodically run |
Heavily redacting to just comment on the bits I want to comment on 😁
Yep, I would be in favor of requiring a full URL for exactly that reason.
I hear you about the lack of a decent Git implementation to do the work, but I just tested the zipball download from Github and the thing that gets downloaded does not expand into a Git repo, and there's a tiny voice in the back of my head that says we're gonna end up regretting that. Also means that the The tiny voice is pretty quiet so maybe it's okay to ignore it.
...and by
Sure, I get it — but there's a point where "ease of use" crosses into "obscuring actual functional differences", and for me this is on the wrong side of that line. This is probably the same tiny voice as before though...
++1 |
Sounds ████████, thanks.
👍
Hmm. Can you say more about why you think we'd regret that? I get the inclination, but I've found myself unable to identify a concrete reason we'd want/need these copies to actually be Git repos.
Why not? I don't follow. There's nothing preventing the use of a Git repo at And one usage pattern I didn't mention here yet (though did in convos with Jover) would be potentially supporting that symlink/
o(f)c. :-P
Nod. Hmm. The functional differences being "makes I'm sympathetic to not wanting to use |
As soon as somebody wants to collaborate with us on something, we're gonna want the data in Git. I agree that it might not be a requirement but there are a lot of troubleshooting things that are going to be way harder, I suspect. (I.e., somebody contacts us, they
But if you're manually symlinking/
The difference being runtimes aren't datasets, and I don't see any gain we get out of obscuring that distinction.
Yeah, I don't think I have anything better than everything you've already decided against. 🤷 |
Ah! I think I see our parity mismatch now: you're expecting If someone's at the "hacking on the source" stage of usage, I fully expect them to be managing their own
In my thinking, the user-visible distinction of runtimes vs. pathogens/workflows exists solely as an accidental/incidental implementation detail. Ideally, the whole runtime thing would be subsumed into pathogens/workflows and while it wouldn't be hidden from users, it'd need not be so visible. I guess put in your terms, I see it the other way: what do we gain from highlighting the (current) distinction? (Also, I think we're talking about "Nextstrain pathogens", not "datasets".)
Yes, sorry, I elided some thinking there where I expect that for development (incl. probably CI), we'd be using either a) And like I said, if we didn't want to manually-imitate
How about any reasons why the downsides I described for those aren't actually as downside-y as I think they are?
I'd considered Qualifying it with |
Adding another option for command naming: |
Hmm. Intriguing. I think Note that I do want to support not just a single version that you can freely upgrade/downgrade, but multiple concurrent versions. |
Re: We already use other technical terms that are not widely understood (e.g. "runtime", "build"), so I don't see the new terminology "pull" as much of a negative. |
Yes, agreed we shouldn't use
Ah, I don't think I agree. I think for many people "installing Nextstrain" would naturally include a pathogen of interest to them. We have a nuanced understanding of all the bits and bobs that are "Nextstrain"¹, but experience suggests to me that most users do not. For example, we very often see folks who (understandably) conflate "Nextstrain" with some specific part of it, e.g. Nextclade. IMO, they shouldn't have to care much about it a lot of the time. ¹ and try to communicate some of that in parts of a whole
Those existing terms are often sources of confusion and require explanation. It'd be nice not to add to that if we have other options. |
Interesting points. I see what you mean and it makes sense under the "workflows as programs" phrasing, but it's a new perspective for me. I'll let the thoughts marinate... |
👍 🧑🍳 ⁂ To followup on my own statement:
At least, shouldn't use both |
c2cd0f6
to
5873eb4
Compare
5873eb4
to
c8d3b82
Compare
fd602f9
to
16c42f1
Compare
16c42f1
to
b529975
Compare
b529975
to
315895f
Compare
06067b9
to
92dfab9
Compare
e4ff937
to
c6e8a47
Compare
nextstrain run
c6e8a47
to
e788489
Compare
Ready for review! I updated the PR description with information for review. The latest CI run is still in-progress but only because of minor changes so I expect it to pass. |
Never a dull moment: I'm seeing the standalone installation process I describe in the PR description above error out like this $ cd "$(mktemp -d)"
$ curl -fsSL --proto '=https' https://nextstrain.org/cli/installer/"$(uname -s)" | NEXTSTRAIN_HOME=$PWD bash -s pr-build/407
[…]
--> Checking installation
Traceback (most recent call last):
File "runpy", line 196, in _run_module_as_main
File "runpy", line 86, in _run_code
File "nextstrain.cli.__main__", line 43, in <module>
File "nextstrain.cli.__main__", line 19, in main
File "nextstrain.cli", line 34, in run
File "argparse", line 1826, in parse_args
File "argparse", line 1859, in parse_known_args
File "argparse", line 2072, in _parse_known_args
File "argparse", line 2012, in consume_optional
File "argparse", line 1936, in take_action
File "nextstrain.cli", line 91, in __call__
File "nextstrain.cli.command.version", line 42, in run
AttributeError: 'types.SimpleNamespace' object has no attribute 'runtimes'
ERROR: installation check failed: unable to run `"/tmp/tmp.LNam0JJ0bw/cli-standalone"/nextstrain --version`; look for error message above. which is, uh, a bit baffling right now. The installed files are there, however, and seem to run just fine $ ./cli-standalone/nextstrain version
Nextstrain CLI 8.5.4+git.8f23baa (standalone) adding to my confusion about what the heck is happening. |
OH! It's running |
e788489
to
fa895c7
Compare
Fixed. There's no harm in installing the PR build before this round of CI completes though; it's just that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only tested out nextstrain run
which works nicely for my (off the happy-path) purposes using the master branch of avian-flu and a from-source pip install of nextstrain-cli:
mkdir -p ~/.nextstrain/pathogens/avian-flu
ln -sv ~/github/nextstrain/avian-flu ~/.nextstrain/pathogens/avian-flu/local=NRXWGYLM
cd <some analysis directory>
# add a `config.yaml` overlay (optional)
envdir ... nextstrain run --aws-batch avian-flu@local segment-focused . auspice/avian-flu_h5n1_pb2_2y.json
Using --augur ~/github/nextstrain/augur
(as per #419 (comment)) works but it takes 20min to upload (because an in-use augur repo size baloons to 675MB). #295 should help here, or excluding certain paths (our docs, .mypy_cache
etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I focused more on the documentation and comments than on the code; others are (hopefully!) better positioned to review the code.
My big comments:
- I think overloading the word "pathogen" to mean "check out of a pathogen repo" causes unnecessary cognitive load when reading the documentation, particularly for folks outside of Nextstrain — using something like
pathogen install
,pathogen installation
,pathogen instance
, or some other variation, would be easier to understand - Similarly, the documentation leans pretty heavily into "set up" (verb) and "setup" (noun), which I think is potentially a source of confusion, especially for folks less familiar with the subtleties of English — picking a different word for one of them (e.g., "install" for the verb or "installation" for the noun, or BYOword) would require less careful parsing of the docs.
That said, I realize this ship has probably sailed, but I needed to point those out as probable issues.
@@ -63,11 +67,11 @@ commands | |||
|
|||
.. option:: update | |||
|
|||
Update a runtime. See :doc:`/commands/update`. | |||
Update a pathogen or runtime. See :doc:`/commands/update`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; suggested for consistency with other phrasing in the doc
Update a pathogen or runtime. See :doc:`/commands/update`. | |
Update a pathogen workflow or runtime. See :doc:`/commands/update`. |
|
||
.. option:: setup | ||
|
||
Set up a runtime. See :doc:`/commands/setup`. | ||
Set up a pathogen or runtime. See :doc:`/commands/setup`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; suggested for consistency with other phrasing in the doc
Set up a pathogen or runtime. See :doc:`/commands/setup`. | |
Set up a pathogen workflow or runtime. See :doc:`/commands/setup`. |
(mainly provided by Nextstrain) using your own configuration, data, and other | ||
supported customizations. Pathogens are initially set up using `nextstrain | ||
setup` and can be updated over time as desired using `nextstrain update`. | ||
Multiple versions of a pathogen may be set up and run independently without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple versions of a pathogen may be set up and run independently without | |
Multiple versions of a pathogen workflow may be set up and run independently without |
|
||
.. option:: <pathogen-name>[@<version>] | ||
|
||
The name (and optionally, version) of a previously set up pathogen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to stop making this correction/suggestion now, but in general I think, if you want a bare, generic, un-adjective-d noun, "workflow" would be a better choice than "pathogen"; otherwise, use "pathogen repo" or "pathogen workflow".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try not to get into naming discussions, but I think "(pathogen) workflow" is going to be used to refer to the Snakemake workflow within a pathogen (repo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my recent comment below. These could all become "pathogen repository".
(Note that "pathogen workflow" is not quite right. There is an important distinction between the pathogen repo and the workflows it contains.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(There may be places where that distinction isn't made clearly enough, and those are my mistake and should be fixed.)
A pathogen may also be fully-specified as ``<name>@<version>=<url>`` | ||
where ``<name>`` and ``<version>`` in this case are (mostly) | ||
arbitrary and ``<url>`` points to a ZIP file containing the | ||
pathogen workflow contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an example of what url points to a ZIP file
looks like might be helpful here?
86758ed
to
79a25f6
Compare
1a27583
to
93ae929
Compare
a6dac89
to
100ecec
Compare
I really appreciate the 👀 and 🧠 on this so far. Thank you! ⁂
Glad to hear it!
Ugh. Compressing archive members during upload (#295) will only go so far here, and there's other considerations to address with that too. The best thing would be to, like you say, avoid uploading the cruft in the first place. I don't want to hardcode paths here, though. Perhaps ⁂
Frankly, I'm less concerned about the code (although still appreciate 👀) and more about the interface and docs and QA aspects, so this is great. 🙌
Yeah, that's fair. "Pathogen" in the docs is short for "pathogen repository", but that felt too much of a mouthful in most instances so I ended up shortening it. That said, "pathogen repository" is the terminology we use in the rest of our documentation, so it should probably also be used here as much as possible. If that sounds good, I can make the updates. (I shy away from "pathogen install" because this project doesn't use that terminology, as you well know; see more commentary below.)
Nod. This is me sticking with the existing convention in this project. And yes, that convention was established by me 🙃 but not with this PR. Certainly this PR adds several more usages though, and maybe that tips the scales towards doing something about it sooner than later. I agree the distinction is a subtlety of English, but I'm not sure I see how "set up" vs. "setup" is all that confusing in practice as a doc reader. Can you point to places where it seems likely to cause specific confusion? It seems to me like a distinction that will be glossed over by most, which is fine. Perhaps that points to not making the distinction English does and using "setup" everywhere? Or, as I mentioned earlier in this PR, I'd be ok with some sort of migration from
It may have sailed… but this isn't the 1700s and we can send it a wireless telegram to change course while it's at sea. ;-) That's to say, I'm open to reconsidering terminology here. I don't think this feature/PR needs to block on "install" vs. "setup", though, since "setup" is already existing terminology. Discussion on terminology may be most productive synchronously. Maybe we can take it to the weekly dev chat? |
Narrowing the scope of this reply to just the terminology stuff:
TBH, the reason "pathogen repository" (and/or "pathogen repo") wasn't in my list of suggested alternatives was because I felt like that phrasing runs counter to the "this tooling is so people don't need to know about Git" imperative that seems to be partially driving the work. Perhaps we can abbreviate most usages of it as "PR", surely that won't cause confusion… 😛
That would probably assuage my concerns, assuming it doesn't make the docco feel too …something.
[snip]
Aside from the one place where I suggested correcting something that was actually a correct usage, because I'd momentarily lost track of which one was which, I can't point to any problematic places. I do note it makes reading the docs slower for me (maybe not a bad thing?) because I have to stop at each use of "set up" or "setup" and briefly figure out whether it's a thing or an action. (That might be specific to my brain, I dunno.)
Happy to do that if we need to keep talking about it! (But also sometimes, when dealing with English language usage issues, it's maybe more clear textually; spoken forms of "set up" and "setup" are even more confusable…) |
Heh. Yeah, that was also part of my "well, I'll just shorten it to pathogen" thinking. Especially since we talk about "Nextstrain pathogens" and otherwise drop the "repository" part all the time, even with external folks. That said, I think "repository" is common enough in science (e.g. sample repository, data repositories) to have enough relatable meaning outside of "version control repository". So I don't feel opposed to using it, esp. since I see we do use it more consistently in documentation.
Nod. I'm ok with slower doc reading! (But I also often find myself iteratively re-reading bits of doc to make sure I'm grasping the details/nuances/omissions/etc… and I'm ok with that, I guess? kind of a basic textual analysis.) For now, I'm inclined to keep "setup" vs. "set up" terminology as it stands (minus an error or two I spotted), with the understanding that changing it would be part of the larger "setup" vs. "install" (vs. ??????) discussion that's been simmering.
Fair! I'm ok with text too. Whatever |
…version_lax() The lax implementation is always successful at producing a Version object suitable for comparisons, even if the version isn't a valid Python package version. This makes it useful for non-Python ecosystems, such as Conda packages, and is what I want to use for pathogen versions too. Renames the strict parse_version() previously present in util to parse_version_strict() because it raises an exception on invalid versions. It is still only used internally in util to parse versions that we expect to be valid.
…ormation Preserves 1) the original version string and 2) whether parse_version_lax() found it PEP-440-compliant or not. The original version string is useful even for compliant versions, as stringifying a Version object will return a normalized representation that may not string compare identically.
They're going to be used for not just runners but also pathogens, so remove "runner" from their names.
…ners themselves Anticipates introduction of pathogens as a thing that can be setup too, which will handle their own default setting.
Adds a new command, `nextstrain run`, to run (compatible) pathogen workflows in a more managed way with easier update paths, without the need for user-facing Git, with support for multiple versions, and with support for concurrent-but-separate analyses via the same workflow. Supported by changes to - `nextstrain setup` to obtain and set up specific versions of pathogens - `nextstrain update` to keep pathogens up-to-date - `nextstrain version` to report on pathogen versions available locally At the moment, the only compatible pathogen is measles at my not-yet-finished demo/prototype branch.¹ Avian flu should not be far behind, though. There's a lot of functionality (and polish) here and elsewhere still todo to fully realize the sweeping goals of workflows-as-programs², but this is a fully-usable first piece of the puzzle that can stand on its own for now. ¹ <nextstrain/measles#55> ² <nextstrain/public#1>
100ecec
to
228fd6c
Compare
based on #419
Adds a new command,
nextstrain run
, to run (compatible) pathogen workflows workflows in a more managed way with easier update paths, without the need for user-facing Git, with support for multiple versions, and with support for concurrent-but-separate analyses via the same workflow.Supported by changes to
nextstrain setup
to obtain and set up specific versions of pathogensnextstrain update
to keep pathogens up-to-datenextstrain version
to report on pathogen versions set upTry out setting up a pathogen and running it yourself. First, install
nextstrain
built from this PR:Linux/macOS:
Windows:
At the moment, the only compatible pathogen is measles at my not-yet-finished demo/prototype branch. Avian flu should not be far behind, though.
Some commands to try:
Note that on any of the updated command documentation pages linked above you can press
d
to see a colorized diff against the latest non-PR version.There's a lot of functionality (and polish) here and elsewhere still todo to fully realize the sweeping goals of workflows-as-programs, but this is a fully-usable first piece of the puzzle that can stand on its own for now.
Checklist