Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use median instead of average values if possible #119

Open
mmokrejs opened this issue Jan 18, 2022 · 3 comments · Fixed by #120
Open

Use median instead of average values if possible #119

mmokrejs opened this issue Jan 18, 2022 · 3 comments · Fixed by #120
Assignees
Labels
Enhancement New feature or request Modeling An issue related to modeling the schema

Comments

@mmokrejs
Copy link

mmokrejs commented Jan 18, 2022

Hi,
I incidentally poked over your project and I wonder why you keep track of Average read depth and of Observed insert size. The former would be better replaced with Median read depth and the latter probably called Outer mate median distance instead? Depends on the tool used to analyze the data. Seems too much Illumina-technology oriented. How will this work for PacBio and Oxford Nanopore sequencing projects? And for older Roche/454 and IonTorrent-based projects which used totally different types of library prep. protocols (RF vs. FR read orientations, etc)? Likewise, Sanger-based genome sequencing?

@joerivandervelde
Copy link
Collaborator

Hello and thanks so much for your feedback! You are right, Median read depth is a more robust quality metric than Average read depth because the latter may be inflated by extreme outliers. The model will be updated soon. Indeed it cannot be denied that there is an Illumina bias in the model because they are currently the predominant vendor, at least in The Netherlands. Your help to resolve this is most appreciated. So is Outer mate median distance a more generic than Observed insert size (i.e. can this term be used for the same and more situations, instead of different ones?) if so, we could replace the term. If not, we probably should introduce a new ontology term. Could you perhaps provide a definition for the Outer mate median distance term, similar to the one that we have for Observed insert size ? which is:

In paired-end sequencing, the DNA between the adapter sequences is the insert. The length of this sequence is known as the insert size, not to be confused with the inner distance between reads. So, fragment length equals read adapter length (2x) plus insert size, and insert size equals read lenght (2x) plus inner distance.

joerivandervelde added a commit to joerivandervelde/fairgenomes-semantic-model that referenced this issue Jan 24, 2022
@joerivandervelde joerivandervelde added Modeling An issue related to modeling the schema Enhancement New feature or request labels Jan 24, 2022
@mmokrejs
Copy link
Author

mmokrejs commented Jan 28, 2022

Hi @joerivandervelde , I am sorry things are stacking up in my mailbox ...

OK, so you fixed the first part already, the read depth-related calculation.

https://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
https://www.biostars.org/p/411012/
https://www.researchgate.net/publication/260170597_Assessment_of_Insert_Sizes_and_Adapter_Content_in_Fastq_Data_from_NexteraXT_Libraries/figures?lo=1

The field you keep in ontology should describe how the sequencing library was prepared and how long the DNA fragements were, on average or better on median. Unfortunately, people tend to discriminate fragment size and insert size, depending whether adapter have been already added or not.

Practically, different SW tools calculate either outer or inner distance. I assume goal of the catalogue is to either collect either of the two values of to decently push users to calculate a single/intended value again using the right software.

See https://broadinstitute.github.io/picard/picard-metric-definitions.html#InsertSizeMetrics
https://gatk.broadinstitute.org/hc/en-us/articles/360037225252-CollectInsertSizeMetrics-Picard-

In other words, this annotation term is supposedly about samtoolss 0x100 SAM_TLEN flag, which shows up in column 9 of SAM formatted output.

While at it, probably you want to add also terms for https://broadinstitute.github.io/picard/picard-metric-definitions.html#JumpingLibraryMetrics .

@joerivandervelde
Copy link
Collaborator

Hello @mmokrejs we've been having internal discussions on how to tackle this but haven't quite sorted it out - could you perhaps clarify the change you are proposing? If metrics are not generic for all sequencing platforms, we might also consider to model it something like this, might that make sense ?

  • ReadQuality (A broadly applicable metric for all platforms) options: [worst / below avg / avg / above avg / best]
  • ReadQualityDefinition (describe how the options map to your quality cutoffs, e.g. insert size 400-500 is best)
  • NGSMetricAvailable (A long option list of all NGS metrics that are measured and stored locally)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request Modeling An issue related to modeling the schema
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants