Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target.as_X methods could be improved #62

Open
MattWellie opened this issue Jan 24, 2025 · 0 comments
Open

Target.as_X methods could be improved #62

MattWellie opened this issue Jan 24, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@MattWellie
Copy link
Contributor

MattWellie commented Jan 24, 2025

This Issue references populationgenomics/production-pipelines#611, which is almost reaching its first birthday, and still outstanding as a valid complaint.

Context

Each Stage in this framework has an expected_outputs method, which can return a number of different data types. See here.

ExpectedResultT = Union[Path, dict[str, Path], dict[str, str], str, None]

For obscure historic reasons we need the majority of these returned file paths to be Path objects, otherwise their existence won't be checked, and the pipeline scheduling won't occur (See here).

To use any of these file paths in Hail (reading input, writing output) we usually need to make them a String, else Hail will complain. Code examples of us doing this here, Relevant mitigation change upstream here, though it's not up for inclusion/deployment any time soon. Even if that change were adopted, we still need to explicitly make the value a String in any PythonJobs and jobs using query_comand - when a Path is printed as a Bash argument, it's printed as the full repr GSPath("gs://...), instead of gs://..., which is not a valid GCP URI.

That's all a long way of saying that the Stage outputs still need to be Paths, otherwise our existence checking won't work, but many times when we engage with them in code we need them to be Strings.

Change Request

As mentioned on the previous Issue, concerning the StageOutput.as_str method:

What I think it should do: Take the output of the indicated Stage, and extract a requested value as a string

What it actually does: Take the output of the indicated Stage, if it's not already a str, throw an exception. If it is a str, re-cast it as a str, then return that.

What I want is a simple helper method to request a value from a previous Stage as a String, whether it was originally a String or not. Likewise it makes sense to have an equivalent Path method.

I'm sure the reason for this current implementation is to clarify an exact return type from the various types in ExpectedResultT for linting/code inspection purposes, but a method that goes one step further and actually does type transformation would not break anything, and would improve code legibility.

Error checking, like ensuring I don't request a Dictionary as a String, makes perfect sense. What we currently have does no meaningful type transformation (if it had to, it would crash the pipeline), and just returns Strings as Strings, and Paths as Paths. IMO this is a silly holdover from the original cpg-flow/prod-pipes codebase, and we should improve it.

The logical name for a method that returns a String representation, even if the original variable was a Path would be StageOutput.as_str, so not only is our current implementation a little misleading, it's also taking up the name that a better version of the same method should have.

Example

https://github.com/search?q=repo%3Apopulationgenomics%2Fproduction-pipelines+str%28inputs.as_path&type=code

var_file = get_batch().read_input(str(inputs.as_path(mc, CopyLatestClinvarFiles, 'variant_file')))
sub_file = get_batch().read_input(str(inputs.as_path(mc, CopyLatestClinvarFiles, 'submission_file')))

becomes

var_file = get_batch().read_input(inputs.as_str(mc, CopyLatestClinvarFiles, 'variant_file'))
sub_file = get_batch().read_input(inputs.as_str(mc, CopyLatestClinvarFiles, 'submission_file'))

I'm re-making this issue here, as once we start using cpg-flow any change to this core function would be a menacing breaking change. Now is a good time to make what would be a really trivial change and have it validated at the same time as we validate the codebase as a whole.

@MattWellie MattWellie added the enhancement New feature or request label Jan 24, 2025
@MattWellie MattWellie changed the title Target.as_X methods are rubbish Target.as_X methods could be improved Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants