You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For obscure historic reasons we need the majority of these returned file paths to be Path objects, otherwise their existence won't be checked, and the pipeline scheduling won't occur (See here).
To use any of these file paths in Hail (reading input, writing output) we usually need to make them a String, else Hail will complain. Code examples of us doing this here, Relevant mitigation change upstream here, though it's not up for inclusion/deployment any time soon. Even if that change were adopted, we still need to explicitly make the value a String in any PythonJobs and jobs using query_comand - when a Path is printed as a Bash argument, it's printed as the full repr GSPath("gs://...), instead of gs://..., which is not a valid GCP URI.
That's all a long way of saying that the Stage outputs still need to be Paths, otherwise our existence checking won't work, but many times when we engage with them in code we need them to be Strings.
What I think it should do: Take the output of the indicated Stage, and extract a requested value as a string
What it actually does: Take the output of the indicated Stage, if it's not already a str, throw an exception. If it is a str, re-cast it as a str, then return that.
What I want is a simple helper method to request a value from a previous Stage as a String, whether it was originally a String or not. Likewise it makes sense to have an equivalent Path method.
I'm sure the reason for this current implementation is to clarify an exact return type from the various types in ExpectedResultT for linting/code inspection purposes, but a method that goes one step further and actually does type transformation would not break anything, and would improve code legibility.
Error checking, like ensuring I don't request a Dictionary as a String, makes perfect sense. What we currently have does no meaningful type transformation (if it had to, it would crash the pipeline), and just returns Strings as Strings, and Paths as Paths. IMO this is a silly holdover from the original cpg-flow/prod-pipes codebase, and we should improve it.
The logical name for a method that returns a String representation, even if the original variable was a Path would be StageOutput.as_str, so not only is our current implementation a little misleading, it's also taking up the name that a better version of the same method should have.
I'm re-making this issue here, as once we start using cpg-flow any change to this core function would be a menacing breaking change. Now is a good time to make what would be a really trivial change and have it validated at the same time as we validate the codebase as a whole.
The text was updated successfully, but these errors were encountered:
This Issue references populationgenomics/production-pipelines#611, which is almost reaching its first birthday, and still outstanding as a valid complaint.
Context
Each Stage in this framework has an expected_outputs method, which can return a number of different data types. See here.
ExpectedResultT = Union[Path, dict[str, Path], dict[str, str], str, None]
For obscure historic reasons we need the majority of these returned file paths to be
Path
objects, otherwise their existence won't be checked, and the pipeline scheduling won't occur (See here).To use any of these file paths in Hail (reading input, writing output) we usually need to make them a String, else Hail will complain. Code examples of us doing this here, Relevant mitigation change upstream here, though it's not up for inclusion/deployment any time soon. Even if that change were adopted, we still need to explicitly make the value a String in any PythonJobs and jobs using query_comand - when a Path is printed as a Bash argument, it's printed as the full repr
GSPath("gs://...)
, instead ofgs://...
, which is not a valid GCP URI.That's all a long way of saying that the Stage outputs still need to be Paths, otherwise our existence checking won't work, but many times when we engage with them in code we need them to be Strings.
Change Request
As mentioned on the previous Issue, concerning the StageOutput.as_str method:
What I want is a simple helper method to
request a value from a previous Stage as a String, whether it was originally a String or not
. Likewise it makes sense to have an equivalent Path method.I'm sure the reason for this current implementation is to clarify an exact return type from the various types in
ExpectedResultT
for linting/code inspection purposes, but a method that goes one step further and actually does type transformation would not break anything, and would improve code legibility.Error checking, like ensuring I don't request a Dictionary as a String, makes perfect sense. What we currently have does no meaningful type transformation (if it had to, it would crash the pipeline), and just returns Strings as Strings, and Paths as Paths. IMO this is a silly holdover from the original cpg-flow/prod-pipes codebase, and we should improve it.
The logical name for a method that
returns a String representation, even if the original variable was a Path
would beStageOutput.as_str
, so not only is our current implementation a little misleading, it's also taking up the name that a better version of the same method should have.Example
https://github.com/search?q=repo%3Apopulationgenomics%2Fproduction-pipelines+str%28inputs.as_path&type=code
becomes
I'm re-making this issue here, as once we start using cpg-flow any change to this core function would be a menacing breaking change. Now is a good time to make what would be a really trivial change and have it validated at the same time as we validate the codebase as a whole.
The text was updated successfully, but these errors were encountered: