Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use paired-end reads with singlesketch? #571

Open
ccbaumler opened this issue Jan 6, 2025 · 22 comments
Open

How can I use paired-end reads with singlesketch? #571

ccbaumler opened this issue Jan 6, 2025 · 22 comments

Comments

@ccbaumler
Copy link

I am attempting to use sourmash sketch for thousands of pair-end fastq files, but it is currently the bottleneck of the workflow. I tried doing something like:

fastp --in1 {input.r1} --in2 {input.r2} --stdout | sourmash scripts singlesketch - -p "{params.moltype},{params.k_list},scaled={params.scaled}" -o {output.sig} --name '{wildcards.sample}'

This has resulted in sketches of 1 hash:

== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

** loading from 'sigs/SRR5665160.trim.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp4/2024-ccbaumler-crc/sigs/SRR5665160.trim.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      1 total hashes

Can I use singlesketch with paired end reads in another way?

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

I am skeptical that singlesketch takes stdin - @bluegenes?

What happens when you put the reads in an intermediate file and then sketch that file?

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

feels like #535 might help out here.

A short term fix if you really want to avoid creating an intermediate file might be to use a named pipe - https://stackoverflow.com/questions/4113986/example-of-using-named-pipes-in-linux-shell-bash

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

manysketch can also explicitly take in paired end, and you can do it one pair at a time.

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

(that is, you can create a single-row CSV for each pair of files and it will work just fine)

@ccbaumler
Copy link
Author

I am only concerned about the amount of total diskspace the workflow would require to run manysketch in that fashion.

@bluegenes
Copy link
Contributor

I am skeptical that singlesketch takes stdin - @bluegenes?

What happens when you put the reads in an intermediate file and then sketch that file?

The docs say yes -- @mr-eyes?

It's possible #523 broke something, but all tests are passing...

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

if you run fastp --in1 {input.r1} --in2 {input.r2} --stdout | head what does that look like?

@bluegenes
Copy link
Contributor

Ok looking into it a little more, stdin should work, but there are no tests in place to check it. Best to confirm the file is complete with @ctb's strategy above first, then we can look into the potential issue

@ccbaumler
Copy link
Author

if you run fastp --in1 {input.r1} --in2 {input.r2} --stdout | head what does that look like?

Everything looks right to me for this command.

fastp --in1 SRR16124209_1.trim.fastq --in2 SRR16124209_2.trim.fastq --stdout | head
Streaming uncompressed interleaved reads to STDOUT...
Enable interleaved output mode for paired-end input.

@SRR16124209.1 1 length=138
AATAAGGGCTGTCCTGGTTTTCCAAACGGACGATCTTCTCCAACAAACGAAGGACATCCTCCCTCTCAGGAACATTCACATCCGCTCTCACCATCGTTATAAGTCCGTTCCAGTCGCTTCCCGAAAAATGAAAATCCC
+SRR16124209.1 1 length=138
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@SRR16124209.1 1 length=138
GGGATTTTCATTTTTCGGGAAGCGACTGGAACGGACTTATAACGATGGTGAGAGCGGATGTGAATGTTCCTGAGAGGGAGGATGTCCTTCGTTTGTTGGAGAAGATCGTCCGTTTGGAAAACCAGGACAGCCCTTATT
+SRR16124209.1 1 length=138
FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF
@SRR16124209.2 2 length=150
GAAACATCTCTGCTCCTATGCTTTTGTTGATTACGTGAGAAAATAATGATTAATCTTCATCTTCTATGTCGACCATAATCGGCATTTGCCGCTTAATGGCTTCATGGAAGGAGATTAATGTTTCGGTACGTGCTAAGCCCACTGGTTGCA

@bluegenes
Copy link
Contributor

Ok I think needletail has parse_fastx_stdin, which should probably be used instead of parse_fastx_reader, will see about dropping in and testing

@ccbaumler
Copy link
Author

I am currently waiting on this command to finish using cat instead of fastp:

$ cat SRR16124209* | sourmash scripts singlesketch - -p "dna,abund,k=21,k=31,k=51,scaled=1000" -o SRR16124209.test.zip --name 'SRR16124209'

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.12; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['dna,abund,k=21,k=31,k=51,scaled=1000']
sketching file '-' (DNA) with params 'dna,abund,k=21,k=31,k=51,scaled=1000' and name 'SRR16124209' using a single thread
Building 3 sketch types:
    DNA,k=31,scaled=1000,num=0,abund=true
    DNA,k=21,scaled=1000,num=0,abund=true
    DNA,k=51,scaled=1000,num=0,abund=true

@ccbaumler
Copy link
Author

I am finding the same result with cat outside of the workflow.

Attempt 1:

$ cat SRR16124209* | sourmash scripts singlesketch - -p "dna,abund,k=21,k=31,k=51,scaled=1000" -o SRR16124209.test.zip --name 'SRR16124209'

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.12; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['dna,abund,k=21,k=31,k=51,scaled=1000']
sketching file '-' (DNA) with params 'dna,abund,k=21,k=31,k=51,scaled=1000' and name 'SRR16124209' using a single thread
Building 3 sketch types:
    DNA,k=31,scaled=1000,num=0,abund=true
    DNA,k=21,scaled=1000,num=0,abund=true
    DNA,k=51,scaled=1000,num=0,abund=true
calculated 3 signatures for 79454486 sequences in -
Writing manifest
...singlesketch is done! results in 'SRR16124209.test.zip'
$ sourmash sig summarize SRR16124209.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRR16124209.test.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR16124209.test.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      1 total hashes

attempt 2:

cat fastq/SRR5665174* | sourmash scripts singlesketch - -p "dna,abund,k=21,k=31,k=51,scaled=1000" -o SRR5665174.test.zip --name 'SRR5665174'

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.12; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['dna,abund,k=21,k=31,k=51,scaled=1000']
sketching file '-' (DNA) with params 'dna,abund,k=21,k=31,k=51,scaled=1000' and name 'SRR5665174' using a single thread
Building 3 sketch types:
    DNA,k=21,scaled=1000,num=0,abund=true
    DNA,k=51,scaled=1000,num=0,abund=true
    DNA,k=31,scaled=1000,num=0,abund=true
calculated 3 signatures for 6474324 sequences in -
Writing manifest
...singlesketch is done! results in 'SRR5665174.test.zip'
(branchwater) baumlerc@farm:/group/ctbrowngrp4/2024-ccbaumler-crc$ ls
SRR16124209.test.zip  SRR16124209.trim.fastq  SRR16124209_1.trim.fastq  SRR16124209_2.trim.fastq  SRR5665174.test.zip  fastp  fastq  fastqc  sigs  sra  stats  test-count.csv  validate
$ sourmash sig summarize SRR5665174.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRR5665174.test.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      1 total hashes

@ccbaumler
Copy link
Author

I just tried combining all the fastq files into a single intermediate:

 cat fastq/SRR5665174* > SRR5665174.test.cat

And I am still finding only 1 hash:

$ sourmash scripts singlesketch SRR5665174.test.cat -p "abund,k=21,k=31,k=51,scaled=1000" -o SRR5665174.test.zip --name 'SRR5665174'

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.12; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['abund,k=21,k=31,k=51,scaled=1000,dna']
sketching file 'SRR5665174.test.cat' (DNA) with params 'abund,k=21,k=31,k=51,scaled=1000,dna' and name 'SRR5665174' using a single thread
Building 3 sketch types:
    DNA,k=21,scaled=1000,num=0,abund=true
    DNA,k=31,scaled=1000,num=0,abund=true
    DNA,k=51,scaled=1000,num=0,abund=true
calculated 3 signatures for 6474324 sequences in SRR5665174.test.cat
Writing manifest
...singlesketch is done! results in 'SRR5665174.test.zip'
$ sourmash sig summarize SRR5665174.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRR5665174.test.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      1 total hashes

@ccbaumler
Copy link
Author

When using a standard fastq file I am still getting a single hash back:

head fastq/SRR5665174_1.trim.fastq
@SRR5665174.1.1 1 length=126
TGGGATACCACCCTTGTAGTACTGGGATTCTAACCAGAGGCCATGAATCTGGTCTTGGGACACTGTCAGGTGGACAGTTTGACTGGGGCGGTCGCCTCCCAAAAAGTAACGGAGGCGCTCAAAGGT
+SRR5665174.1.1 1 length=126
::<@AEGGGEEGGGGGGG1EGGGGCGBCGGGGGFBGGGGGGGGGGGGGGGGGGDGGGFGGGGGEGGGFGGGGGGGGGGGGGGGGGEGGGGGGGGGGGG@DGGG<.CDGGEGGBGGDGGGGC=@DGG
@SRR5665174.2.1 2 length=126
GTTTAATGGTAGTTACTGATTTAATTGCAAGAAGTATATTCGCACCAACTGAGTTGCCAGTAGGTGCAGTTACAGCTCTTATAGGGGCTCCATTTTTTGCATATGTTTATTTTAAAAAGGCGAGTT
+SRR5665174.2.1 2 length=126
3<>BAFFBGGGFFCFEGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGFEGGGCFGGGGGGGCGGGEGGGGGGGGGFFGGGEGGGGGCGGGGGG0DCBCCC0DGGED>GED>F0C=FGF<GGGF>C
@SRR5665174.3.1 3 length=126
GCATTTCGGACATTTCTTCCAGGCGGACTTCCCGGCTTCATCGCCACCGACATGGATATATTCGGACGGGAAAAGATCGATCACCTCCGCC
$ sourmash scripts singlesketch fastq/SRR5665174_1.trim.fastq -p "abund,k=21,k=31,k=51,scaled=1000" -o SRR5665174.test.zip --name 'SRR5665174'

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.12; cite Irber et al., doi: 10.1101/2022.11.02.514947

params: ['abund,k=21,k=31,k=51,scaled=1000,dna']
sketching file 'fastq/SRR5665174_1.trim.fastq' (DNA) with params 'abund,k=21,k=31,k=51,scaled=1000,dna' and name 'SRR5665174' using a single thread
Building 3 sketch types:
    DNA,k=51,scaled=1000,num=0,abund=true
    DNA,k=31,scaled=1000,num=0,abund=true
    DNA,k=21,scaled=1000,num=0,abund=true
calculated 3 signatures for 3237162 sequences in fastq/SRR5665174_1.trim.fastq
Writing manifest
...singlesketch is done! results in 'SRR5665174.test.zip'
$ sourmash sig summarize SRR5665174.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRR5665174.test.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
is database? yes
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3
summary of sketches:
   1 sketches with DNA, k=21, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=31, scaled=1000, abund      1 total hashes
   1 sketches with DNA, k=51, scaled=1000, abund      1 total hashes

@bluegenes
Copy link
Contributor

@ccbaumler can you run sourmash sig describe as well on the files? I think there may be a bug with branchwater zip files and the hash count

@ccbaumler
Copy link
Author

Nice, I should have thought of that!

For the SRR5665174_1.trim.fastq file:

$ sourmash sig describe SRR5665174.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
signature: SRR5665174
source file: fastq/SRR5665174_1.trim.fastq
md5: 5bbff9a7db0ba20f601e6d861631a653
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 29736
sum hashes: 250385
signature license: CC0

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
signature: SRR5665174
source file: fastq/SRR5665174_1.trim.fastq
md5: 16bf920c440b00fa5d6df7dcdd7406e2
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 29154
sum hashes: 212849
signature license: CC0

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR5665174.test.zip
signature: SRR5665174
source file: fastq/SRR5665174_1.trim.fastq
md5: d361556af1005d9abfdaaa70447adcbb
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 25289
sum hashes: 159743
signature license: CC0

loaded 3 signatures total, from 1 files

And for the cat SRR16124209* | pipe:

$ sourmash sig describe SRR16124209.test.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR16124209.test.zip
signature: SRR16124209
source file: -
md5: 5ebe8b3446f0889014736edbb1ed22b4
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 456128
sum hashes: 9561725
signature license: CC0

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR16124209.test.zip
signature: SRR16124209
source file: -
md5: fdb00da0da0dbae93fd4ca8b0672837a
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 486709
sum hashes: 8803512
signature license: CC0

---
signature filename: /group/ctbrowngrp4/2024-ccbaumler-crc/SRR16124209.test.zip
signature: SRR16124209
source file: -
md5: 6de1d3de23c51ac12da89ac235eef1b5
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 506030
sum hashes: 7249279
signature license: CC0

loaded 3 signatures total, from 1 files

@bluegenes
Copy link
Contributor

yep -- those look like fine signatures, this is an issue with the manifest from singlesketch/manysketch that impacts sig summarize.

@ccbaumler
Copy link
Author

Thanks for helping me figure that out. Is there anyway of manually updating a sourmash signature manifest?

@bluegenes
Copy link
Contributor

can you unzip one of those files and use less or cat to put the SOURMASH-MANIFEST.csv lines here please so I can verify the column issues?

If you're only doing one or two zips, you could edit that file and then rezip all the files together. Not an easy fix for many

@ccbaumler
Copy link
Author

$ unzip SRR5665174.test.zip
Archive:  SRR5665174.test.zip
 extracting: signatures/5bbff9a7db0ba20f601e6d861631a653.sig.gz
 extracting: signatures/16bf920c440b00fa5d6df7dcdd7406e2.sig.gz
 extracting: signatures/d361556af1005d9abfdaaa70447adcbb.sig.gz
 extracting: SOURMASH-MANIFEST.csv
$ cat SOURMASH-MANIFEST.csv
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/5bbff9a7db0ba20f601e6d861631a653.sig.gz,5bbff9a7db0ba20f601e6d861631a653,5bbff9a7,21,DNA,0,1000,1,1,SRR5665174,fastq/SRR5665174_1.trim.fastq
signatures/16bf920c440b00fa5d6df7dcdd7406e2.sig.gz,16bf920c440b00fa5d6df7dcdd7406e2,16bf920c,31,DNA,0,1000,1,1,SRR5665174,fastq/SRR5665174_1.trim.fastq
signatures/d361556af1005d9abfdaaa70447adcbb.sig.gz,d361556af1005d9abfdaaa70447adcbb,d361556a,51,DNA,0,1000,1,1,SRR5665174,fastq/SRR5665174_1.trim.fastq

@bluegenes
Copy link
Contributor

ok, thinking about it, this is probably an issue for all our recent databases. @ctb it's probably not too hard to update the manifests using sourmash python layer zip foo, right?

@ctb
Copy link
Collaborator

ctb commented Jan 6, 2025

no problem at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants