Use median instead of average values if possible #119

mmokrejs · 2022-01-18T17:09:53Z

Hi,
I incidentally poked over your project and I wonder why you keep track of Average read depth and of Observed insert size. The former would be better replaced with Median read depth and the latter probably called Outer mate median distance instead? Depends on the tool used to analyze the data. Seems too much Illumina-technology oriented. How will this work for PacBio and Oxford Nanopore sequencing projects? And for older Roche/454 and IonTorrent-based projects which used totally different types of library prep. protocols (RF vs. FR read orientations, etc)? Likewise, Sanger-based genome sequencing?

The text was updated successfully, but these errors were encountered:

joerivandervelde · 2022-01-24T13:18:46Z

Hello and thanks so much for your feedback! You are right, Median read depth is a more robust quality metric than Average read depth because the latter may be inflated by extreme outliers. The model will be updated soon. Indeed it cannot be denied that there is an Illumina bias in the model because they are currently the predominant vendor, at least in The Netherlands. Your help to resolve this is most appreciated. So is Outer mate median distance a more generic than Observed insert size (i.e. can this term be used for the same and more situations, instead of different ones?) if so, we could replace the term. If not, we probably should introduce a new ontology term. Could you perhaps provide a definition for the Outer mate median distance term, similar to the one that we have for Observed insert size ? which is:

In paired-end sequencing, the DNA between the adapter sequences is the insert. The length of this sequence is known as the insert size, not to be confused with the inner distance between reads. So, fragment length equals read adapter length (2x) plus insert size, and insert size equals read lenght (2x) plus inner distance.

mmokrejs · 2022-01-28T17:20:06Z

Hi @joerivandervelde , I am sorry things are stacking up in my mailbox ...

OK, so you fixed the first part already, the read depth-related calculation.

https://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
https://www.biostars.org/p/411012/
https://www.researchgate.net/publication/260170597_Assessment_of_Insert_Sizes_and_Adapter_Content_in_Fastq_Data_from_NexteraXT_Libraries/figures?lo=1

The field you keep in ontology should describe how the sequencing library was prepared and how long the DNA fragements were, on average or better on median. Unfortunately, people tend to discriminate fragment size and insert size, depending whether adapter have been already added or not.

Practically, different SW tools calculate either outer or inner distance. I assume goal of the catalogue is to either collect either of the two values of to decently push users to calculate a single/intended value again using the right software.

See https://broadinstitute.github.io/picard/picard-metric-definitions.html#InsertSizeMetrics
https://gatk.broadinstitute.org/hc/en-us/articles/360037225252-CollectInsertSizeMetrics-Picard-

In other words, this annotation term is supposedly about samtoolss 0x100 SAM_TLEN flag, which shows up in column 9 of SAM formatted output.

While at it, probably you want to add also terms for https://broadinstitute.github.io/picard/picard-metric-definitions.html#JumpingLibraryMetrics .

joerivandervelde · 2022-02-07T08:04:54Z

Hello @mmokrejs we've been having internal discussions on how to tackle this but haven't quite sorted it out - could you perhaps clarify the change you are proposing? If metrics are not generic for all sequencing platforms, we might also consider to model it something like this, might that make sense ?

ReadQuality (A broadly applicable metric for all platforms) options: [worst / below avg / avg / above avg / best]
ReadQualityDefinition (describe how the options map to your quality cutoffs, e.g. insert size 400-500 is best)
NGSMetricAvailable (A long option list of all NGS metrics that are measured and stored locally)

joerivandervelde added a commit to joerivandervelde/fairgenomes-semantic-model that referenced this issue Jan 24, 2022

Fixes fairgenomes#119

de715f6

joerivandervelde added Modeling An issue related to modeling the schema Enhancement New feature or request labels Jan 24, 2022

joerivandervelde mentioned this issue Jan 28, 2022

Fix and units #120

Merged

joerivandervelde closed this as completed in #120 Jan 28, 2022

joerivandervelde reopened this Jan 29, 2022

joerivandervelde assigned joerivandervelde and ljohansson Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use median instead of average values if possible #119

Use median instead of average values if possible #119

mmokrejs commented Jan 18, 2022 •

edited

Loading

joerivandervelde commented Jan 24, 2022

mmokrejs commented Jan 28, 2022 •

edited

Loading

joerivandervelde commented Feb 7, 2022

Use median instead of average values if possible #119

Use median instead of average values if possible #119

Comments

mmokrejs commented Jan 18, 2022 • edited Loading

joerivandervelde commented Jan 24, 2022

mmokrejs commented Jan 28, 2022 • edited Loading

joerivandervelde commented Feb 7, 2022

mmokrejs commented Jan 18, 2022 •

edited

Loading

mmokrejs commented Jan 28, 2022 •

edited

Loading