Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Summary workflow): Utils script/module to filter and convert ground truth TPM data to be compatible with relative quantification benchmark #301

Open
faricazjj opened this issue Jun 12, 2022 · 3 comments · Fixed by #414
Assignees

Comments

@faricazjj
Copy link
Collaborator

We're adding a new quantification challenge relative expression output. However, since the quantification challenge originally outputs TPM values, the ground truth data was also in TPM. Hence, we need to convert the ground truth data from TPM to relative expression to be compatible with the new quantification challenge output

Depends on relative expression implementation: #277

Estimate: 4h

@ninsch3000
Copy link
Collaborator

Might this be a task for @mrgazzara when he's already busy with filtering the ground truth?

@SamBryce-Smith SamBryce-Smith self-assigned this Aug 26, 2022
@SamBryce-Smith SamBryce-Smith changed the title Convert ground truth TPM data to be compatible with new relative expression quantification metric Summary workflow: Utils script/module to filter and convert ground truth TPM data to be compatible with relative quantification benchmark Aug 26, 2022
@SamBryce-Smith
Copy link
Collaborator

SamBryce-Smith commented Aug 26, 2022

I'm co-opting this issue to prevent duplication. Purpose is to track implementation of utils script/module to prepare ground truth files for the relative quantification benchmark/summary workflow.

Related to proof-of-concept implementation - #399

The plan is to generally follow Joseph's blueprint (TODO: add link) for filtering the ground truth to two representative sites overlapping terminal exons. In addition to this, I propose some additional details to address challenges I encountered in the proof-of-concept #399 along with discussions with @mrgazzara .

Partially blocked by #413 - defining a 'terminal exon ID'. For now we're going to assume we're following my proposed definition.

This script/module will take as input:

  • Reference GTF of gene/transcript models
  • Ground truth BED file of polyA sites
  • Parameter to define minimum % of total expression of sites on terminal exon for a pair of highest expressed PAS to be retained - 'min_total_expr_frac'. Joseph suggested 80 % / 90 % as examples.
  • Parameter to define the minimum fractional expression of an individual site - 'min_frac_site' - e.g. site has to have minimal usage of >= x % (e.g. 5 %).
  • Parameter to define window size used to match ground truth & predicted polyA sites - 'window_size'

The workflow is more or less as follows:

  • Read in GTF file and extract terminal exons for every transcript
  • Merge overlapping terminal exons of each gene into non-redundant, union terminal exons. Keep track of gene_id and transcript IDs contributing to union terminal exon.
  • Find terminal exons with at least two overlapping ground truth polyA sites
  • Overlap ground truth polyA sites with terminal exons from previous step)
    • Report count &/ fraction of ground truth PAS excluded due to this filter
  • Compute sum of TPMs of polyA sites overlapping each terminal exon.
  • Select the two highest expressed polyA sites for each terminal exon.
  • Filter for terminal exons where two highest expressed sites are above 'min_total_expr_frac' of total PAS expression on their respective terminal exon
    • Report count &/fraction of terminal exons excluded due to this filter
  • Filter for terminal exons where PAS are more than 'window_size' nt from one another
    • Report count &/fraction of terminal exons excluded due to this filter
  • Calculate fractional relative usage of selected sites according to total expression of PAS on the terminal exon (i.e. sum of proximal and distal PAS for total expression)
  • Check that minor PAS fractional relative usage is >= 'min_frac_site'
  • Annotate selected sites as proximal (1) / distal (2) according to their relative genomic position on each terminal exon
  • Construct terminal exon IDs for GT & TE files, output to BEDs

@yuukiiwa & I will be working on adapting my proof of concept into this module. I have code for most of the above steps but will need to be tidied up a little into functions / a script

@SamBryce-Smith
Copy link
Collaborator

see branch utils_filterPAS for a WIP implementation

@ninsch3000 ninsch3000 changed the title Summary workflow: Utils script/module to filter and convert ground truth TPM data to be compatible with relative quantification benchmark feat(Summary workflow): Utils script/module to filter and convert ground truth TPM data to be compatible with relative quantification benchmark Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants