Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #18

Merged
merged 9 commits into from
Nov 18, 2020
Merged

Dev #18

merged 9 commits into from
Nov 18, 2020

Conversation

bricoletc
Copy link
Member

This PR has two main goals:

Added

  • IntervalPartitioner object and associated classes, in interval_partitioner.py
    This object replaces the partition_alignment_into_intervals function, which was harder to understand,
    also makes it more easily to modify in order to address Stop producing empty alleles #17 and in future.
    Also added more unit tests for interval partitioning.
  • This object, when partitioning the consensus string, now spots if a non_match interval induces any empty sequence, and pads any leading match sequence to it. Unit tests to validate the behaviour, and I ran the following toy example:
>seq1
AATAATAATAAATTTTGTATAAACTTTACCCTAG
>seq2
AATAATAAT----------------TTACCCTAG
>seq3
AATAATAATAAATTTTATATAAACTTTACCCTAG

Before this PR, results in: AATAATAAT 5 AAATTTTGTATAAACT 6 AAATTTTATATAAACT 6 5 TTACCCTAG (see the empty third allele`

After this PR, results in : AATAATAA 5 TAAATTTTGTATAAACT 6 TAAATTTTATATAAACT 6 T 5 TTACCCTAG ( the T got prepended)
(max_nesting 5, min_match_len 7)

This does not mean #17 is fully resolved, i'll now run on real datasets

Modified

  • An alignment column of all '-' is no longer counted as a match. I cannot find a good reason for it to be but may be missing something?

  • Moved subcommand parsing to subcommands/ dir, results in less cluttered main()

  • get_consensus function shorter and easier to understand + add unit tests for it

  • refactored interval partition checks (checking each position of alignment is in one and one only interval,
    and switching non-match to match intervals where needed) into own functions and unit tested them.

  • -v for verbose switch added at top-level command line

  • Overall I strove to make the main module make_prg_from_msa smaller and terser, for easier reading/modifying later: went from 499 to 385 lines.

* Use generic MSA name for MSA object, defined in __init__
* Much shorter consensus string generation code, with added unit tests
  to make sure nothing broken
* Move functions to test interval consistency out of class
* Unit test them
* New object-oriented interval partitioning code
* Add unit test for small consensus string which led to a bugfix
* Make unit tests work for new interval partitioning code
* Use new interval partitioning code in make_prg code
Copy link
Member

@mbhall88 mbhall88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't provide a thorough review right now. I should be able to get around to it next week.

@bricoletc
Copy link
Member Author

Hi @mbhall88 @rmcolq , I think this should be merged to master. I can certify that it worked for my gramtools paper analysis, producing PRGs where I found, genotyping on them with gramtools, good concordance with PacBio truth assemblies on two different datasets (TB, Pfalciparum)

Copy link
Member

@mbhall88 mbhall88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say I understand all the changes, but sounds like your testing has been thorough.

Nice changes from a software engineering point of view. Nice and readable. Should make any future work much easier to do!

self.stop += right_delta

def contains(self, position: int):
return self.start <= position <= self.stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add a docstring mentioning what sort of intervals this class stores. i.e. half-open, open, closed. It seems like this interval is a closed interval?
Also, just mentioning this incase you aren't aware of it: https://github.com/chaimleib/intervaltree
I use this for all interval work and it is super nice. I'm not suggesting you replace the class here or anything. Just putting it on your radar if it isn't already in case there is need for it down the line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i added a docstring

make_prg/interval_partition.py Outdated Show resolved Hide resolved
return IntervalType.NonMatch
else:
return IntervalType.Match
@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants