Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqtk cutN tolerates non-N bases in its output #18

Open
charles-plessy opened this issue Oct 9, 2024 · 0 comments
Open

seqtk cutN tolerates non-N bases in its output #18

charles-plessy opened this issue Oct 9, 2024 · 0 comments
Labels
bug Something isn't working
Milestone

Comments

@charles-plessy
Copy link
Collaborator

charles-plessy commented Oct 9, 2024

Description of the bug

I just figured out that seqtk cutN tolerates non-N bases in its output. In the example below you can see that it reports the range 229-610, which is not exclusively made of Ns. Worse, it reports overlapping ranges, which confuses tools. Setting a high penalty with the -p option apparently solves the problem.

As I misunderstood how seqtk cutN works, the regions maked in pink in the dotplots may be overly broad. To resolve this issue I need either:

  • Set -p to a value that I know is always high enough, or
  • replace seqtk with an awk script.

Command used and terminal output



cat > test.fa <<__END__
>test
TNNNNNNNNNNNNNNNNTTATTTAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCACTTTTAATTNN
NNNNNNNNNNNCTATTTAATCCTTCTTTTTCTTTAATCTTAAAATTATCNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNTTATANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNTAAGATT
TATANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNTTATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNATNNNNNNNNNNNNNNNNNNNATTNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNAGCTCTTTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNTAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAATA
ATTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
__END__
$ seqtk cutN -g -n 1 test.fa 
test	1	17
test	26	166
test	178	191
test	229	610
test	612	631
test	229	665
test	675	740
test	745	776
test	783	831
$ seqtk cutN -g -n 1 -p 10000 test.fa 
test	1	17
test	26	166
test	178	191
test	229	304
test	309	396
test	397	413
test	424	495
test	499	610
test	612	631
test	634	665
test	675	740
test	745	776
test	783	831
@charles-plessy charles-plessy added the bug Something isn't working label Oct 9, 2024
charles-plessy added a commit to oist/LuscombeU_stlrepeatmask that referenced this issue Oct 9, 2024
See <nf-core/pairgenomealign#18> for details.

After this change it should not be needed to sort the ranges anymore.
(This was a symptom of the issue that the sorting command was dusting
under the carpet without me realising it).
charles-plessy added a commit that referenced this issue Oct 15, 2024
Added a 'target' parameter
@charles-plessy charles-plessy added this to the 2.0.0 milestone Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant