Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subworkflow for functional enrichment analysis #7254

Open
wants to merge 74 commits into
base: master
Choose a base branch
from

Conversation

suzannejin
Copy link
Contributor

@suzannejin suzannejin commented Dec 20, 2024

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

suzannejin and others added 19 commits December 11, 2024 16:47
…ow locally implemented in differentialabundance.

The original code was the result of the effort from multiple collaborators:

Co-authored-by: Björn Langer <[email protected]>
Co-authored-by: Cristina Araiz <[email protected]>
Co-authored-by: Thomas Tams <[email protected]>
Co-authored-by: Breeshey Roskams-Hieter <[email protected]>
…profiler2 that works when empty channels are given as Channel.of([], []), but stop working when empty channels are given as null, as combine/join methods cannot work on null
…work. Currently it is not running through GSEA modules -> check input channel
@suzannejin
Copy link
Contributor Author

suzannejin commented Jan 20, 2025

@pinin4fjords @grst @nschcolnicov @mirpedrol @JoseEspinosa Hello! Here you have the functional analysis subworkflow! Let me know what you think, specially in terms of optional input/output that would be needed given the fact that different functional analysis tools might need different inputs and produce different outputs.

On the other hand, the gprofiler2 test always fails in the CI here when using conda, as it produces png files with different md5. However, when I run it using Gitpod with conda, it always passes. Any idea of why? @pinin4fjords @nschcolnicov

@pinin4fjords
Copy link
Member

Thanks @suzannejin ! I'll have a look closely soon.

PNGs are not snapshotable, they change across systems, and you've noted. The best you can do with those is check the name.

// here we define the input channel for the GSEA section

def criteria = multiMapCriteria { meta_input, input, meta_exp, samplesheet, featuresheet, features_id, features_symbol, meta_contrasts, variable, reference, target ->
def analysis_method = meta_input.method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed if we merge meta_input which already contains the method key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right! this was left from a previous implementation that need it explicit, but I can remove it now

Comment on lines 76 to 77
ch_gene_sets.collect(),
ch_background.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using collect() correct? won't it put everything (meta + files) into one single list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for using .collect() is that when ch_input has more than one element, but we only provide one ch_gene_sets elements, the module would still run as many times as ch_input elements.

I assume here that we are gonna only provide one gene_sets, this is likely the behavior for gprofiler2 and grea. For GSEA, the differentialabundance pipeline can take more than one gene_sets. Is there any reason for this difference? And how would you deal with this? @pinin4fjords

Copy link
Contributor Author

@suzannejin suzannejin Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just replaced collect() by combining input with gene_sets before hand. This should work for either scenario :)

@suzannejin suzannejin marked this pull request as ready for review January 21, 2025 11:29
ch_input // [ meta_input, input file, method to run ]

// gene sets and background
ch_gene_sets // [ meta_gmt, gmt file ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this necessarily need to be a GMT file?
E.g. decoupler uses weighted gene sets for PROGENy and collecTRI analyses that are typically provided as a long-form data frame. If we included deconvolution as functional analysis tool, it would typically use a signature matrix.

It really depends a bit on the scope of this subworkflow. But if the plan is to support a wide range of functional analysis tools as suggested in nf-core/differentialabundance#367, it would be good to keep this generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I'm always wary of premature over-engineering. I don't object to gmt in the first instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically my suggestion comes down to making the gene sets + background method-specific. Already the current workflow logic doesn't really care about whether it's a gmt file or not. It's only specified in the comment.

Copy link
Contributor Author

@suzannejin suzannejin Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is possible to standardize the gene set input format for all methods, and then each module deals with the reformatting to the proper format specifically needed for the method?

Otherwise I would imagine it become confusing from the pipeline's user perspective to have to provide the input gene set with certain format depending on the method chosen, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The "long-form dataframe" format used by decoupler is quite universal. It can cover signature matrices, weighted and signed gene sets.
  • This is also the format that can be obtained from omnipathdb via API. Omnipathdb contains most of the commonly used signatures, such as MSigDB, GO, Dorothea, Progeny. Like that users wouldn't need to obtain the genesets themselves, but just specify the name.
  • I agree that a common format makes sense, but I might still want to couple certain signatures to certain methods and not necessarily run all-vs-all. E.g. With PROGENy signatures, I'd typically use the recommended MLM algorithm in decoupler, while with MSigDB signatures, I'd rather use GSEA.
  • Not entirely sure yet if deconvolution is in the scope of this subworkflow, but here the methods are often shipped together with a signature matrix, so they wouldn't require any input signature at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. It would be really nice to include omnipathdb, then the coupling between signature and method can be automatized based on name.
For now, we could have a method specific gene sets channel just as input channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled by the prospect of a multiplicity of gene_set inputs per method.

@suzannejin Maybe the gene sets should actually be part of ch_input. That way, there could be method-specific gene set files in that channel, and we wouldn't need multiple input channels

ch_background // [ meta_background, background file ]

// other - for the moment these files are only needed for GSEA
ch_contrasts // [ meta_contrast, contrast_variable, reference, target ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) When providing results from differential expression analysis, a contrast would not be needed.
(2) When providing gene expression as input, differential testing is necessary that is not unlike DE analysis (it needs a model and a contrast definition).

IMO it would make sense to keep differential testing entirely out of this workflow. In case (1), it's not needed, and in case (2), a sample x signature matrix is produced. This matrix could just be fed into a differential analysis workflow (e.g. limma) again.

Like that we can keep the complexity of this subworkflow low.

Copy link
Member

@grst grst Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that the gsea module does support providing a contrast directly when using gene expression data. However, I believe that there's no point in using this mode.

In this case, GSEA anyway just computes a metric (signal2noise, t-statistic, ...) based on these variables (see docs), so we can as well provide a fold change or DE-statistic directly, while having the advantage that we can provide a full model definition including covariates to the DE method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware that the gsea module does support providing a contrast directly when using gene expression data. However, I believe that there's no point in using this mode.

Fair point, but it's what diff. ab. does right now. I'd recognised the possbility of a switch, but hadn't got round to it: nf-core/differentialabundance#36

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point that we could model the current matrix-driven GSEA as part of the differential subworkflow, alongside LIMMA et al though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the current subworkflow to be able to produce the exact behaviour of how the modules are used in the DA pipeline, hence GSEA is taking gene expression data instead of DE. Not sure why this is mode is chosen for the pipeline, maybe @pinin4fjords can step in here?

But yes, I do agree that it would be conceptually cleaner if the subworkflow just takes in DE output, and so for GSEA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of taking this chance of streamlining the workflow now that we are anyway changing quite a few things. We can help with implementing a module for preranked GSEA if required1.

(alternatively, the decoupler module is kind of ready, and it comes with a very fast GSEA implementation -- it should generate the same results in terms of scores and p-values, but it doesn't produce all outputs such as the "leading edge" plots).

Footnotes

  1. In a few weeks. Waiting for the contract extension with our external developers to be signed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be a pain, but I do like those leading edge plots, they are useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against a preranked GSEA module. But what I really would like to get rid of is the contrast specification in this subworkflow.

Copy link
Contributor Author

@suzannejin suzannejin Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually like to start integrating the subworkflows in the current DA pipeline hopefully next week.
If the switch to preranked GSEA would take some time, maybe we could first agree on the current subworkflow version? I think it could serve as a nice starting point, and you are welcome to add modifications above it afterwards

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, it's going the right way.

IMO having GSEA in here is fine, I can imagine other methods in here in future might use them, they can be optional, and 'pre-ranked' methods can ignore them.

I do think the gene_sets should be part of the iterable input if they're going to be method-specific, to save us adding method-wise get set channels to the interface.

- meta_input:
type: map
description: Metadata map
- input:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what is meant by 'input' here please?

ch_input // [ meta_input, input file, method to run ]

// gene sets and background
ch_gene_sets // [ meta_gmt, gmt file ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled by the prospect of a multiplicity of gene_set inputs per method.

@suzannejin Maybe the gene sets should actually be part of ch_input. That way, there could be method-specific gene set files in that channel, and we wouldn't need multiple input channels


// gsea-specific outputs
gsea_report = GSEA_GSEA.out.report_tsvs_ref
.join(GSEA_GSEA.out.report_tsvs_target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange indent


// tool versions
versions = ch_versions
.mix(GPROFILER2_GOST.out.versions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange indents

.mix(CUSTOM_TABULARTOGSEACLS.out.versions)
.mix(CUSTOM_TABULARTOGSEACHIP.out.versions)
.mix(GSEA_GSEA.out.versions)
.mix(PROPR_GREA.out.versions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there no emissions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants