Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous path production #15

Open
bricoletc opened this issue Jun 14, 2020 · 55 comments
Open

Ambiguous path production #15

bricoletc opened this issue Jun 14, 2020 · 55 comments
Assignees

Comments

@bricoletc
Copy link
Member

I produced nested PRGs (--min_match_length=7, --max_nesting=5) from multiple sequence alignments of plasmodium genes and found that make_prg can produce ambiguous PRGs.

Here's an example I drew out by hand to understand

o

_o and _c mean site opening and closing nodes, with a numbered ID before. 7T means 7 consecutive Ts.

The path CA9T can be produced going through either sites 33/34/35 or 33/36. This can be horrible for genotyping as if the dataset has that sequence then you have to either say you can't genotype because of ambiguity, or randomly choose one of the two possibilities. The latter could cause comparison issues across datasets. For now in gramtools I've decided to bail out in these cases and null genotype, with a FILTER flag to AMBIG to signal this occurred.

@iqbal-lab
Copy link
Contributor

Is it possible to define a total order on alternate paths that spell the same sequence? God help us

@bricoletc bricoletc self-assigned this Oct 6, 2020
@bricoletc
Copy link
Member Author

Here's another example which occurs when building PRG from m.tuberculosis, sequences from 1,017 samples, variation called with cortex in genomic positions: NC_000962.3:336600-337630 and this region starts at POS 337368

157.pdf

The following sequence: GCTCCGCCGGTCCCGCCGGTCC can be spelled by following two paths:

  • 157/1, 160/1, 161/0
  • 158/0, 159/0

@bricoletc
Copy link
Member Author

At some level the problem seems to boil to excessive clustering, ie putting sequences that are very similar in different clusters.

Another example where this has clearly occurred:
209.pdf
From Pfalciparum gene DBLMSP, 2,500 input sequences

This should potentially be a single variant site enumerating all the similar sequences.

@iqbal-lab
Copy link
Contributor

These must always be horizontally offset, right?
Is it possible for this to happen with just one leve of nesting?
I guess the answer is yes, you can delete an A and then add two As, or add an A and then add an A

@iqbal-lab
Copy link
Contributor

is it possible without indels? i guess not?

@bricoletc
Copy link
Member Author

bricoletc commented Oct 6, 2020

The kmer counting underlying the clustering ignores alignment gaps- if that's what you're referring to?

Ie, gaps don't change anything: AAA-AAA and AA-AAAA are the same for clustering (at the moment)

@iqbal-lab
Copy link
Contributor

Doh, of course

@iqbal-lab
Copy link
Contributor

Rachel says couyld you please put the MSAs that generated these examples in?

@bricoletc
Copy link
Member Author

Sure, i will need to extract them, and reproduce the graphs- the images are subgraphs from larger PRGs.

I will do this later this week, and also think we can add tests and some work to start alleviating this, if we can clearly define what's desirable and undesirable in these situations

@iqbal-lab
Copy link
Contributor

I will propose that the only thing you need to avoid is having two different source/sink paths which spell the same sequence.

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

I took a closer look at the first example, and I think this is hard to solve for all cases. We might make some improvements here and there, based on some edge cases we find by hand, and it might work for these 3 examples, but this is just a very small number of cases, we won't be sure if it improved overall. I think this is due to a misclustering of sequences, but this is unavoidable as the k-means clustering process is very heuristic, and it might undercluster or overcluster (it is really hard to set the correct inertia parameter for every cluster, if there is one). So, I think that PRGs having ambiguous paths will unavoidably happen. Having this heuristic component also makes it hard to have any algorithm that will work on all situations (I think that whatever algorithm we think about, we can't have a proof of correctness, and we will always be able to develop an edge case in such a way that a misclustering of sequence happens, and this induces ambiguous paths in the PRG).

So, we could think of post-processing algorithms to remove some edges/nodes from the PRG in order to keep a sort of path-spelling uniqueness property. That is not possible neither, I am afraid. In the first graph, it is clear we can remove the CA9T node from the PRG, as it is a redundant node: if we remove it, we can still generate all sequences that we could generate before, and without duplication. So we lose no information, and now we have path-spelling uniqueness. However, it is easy to modify this example in such a way that the CA9T node can't be removed without changing the set of paths that the PRG spells (i.e. add a sequence that can only be generated if going through the CA9T node), and this is a counter-example showing that if a PRG has ambiguous paths, we can't remove nodes or edges in order to break this ambiguity (I don't think breaking ambiguity on the cost of losing paths is worth it).

It seems to me that in some edge cases, we can have path ambiguity in the PRG, and it will be responsibility of the downstream algorithms to be aware and take care of this.

Of course I could be wrong on any of the statements above, and as in bioinformatics, maybe we don't need a correct algorithm, but one that works on our cases (and hope they work on other)...

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

PS: I am not advocating we replace k-means clustering. I think it makes total sense in make_prg, and we should use it as we are using now

@bricoletc
Copy link
Member Author

Thanks for pitching in @leoisl , I have two thoughts:
i) whatever the upstream construction algorithm, downstream application needs some kind of awareness of path ambiguity if we know it can happen. in gramtools if i detect it at genotyping, I null all the calls in the path and set a "AMBIG" filter
ii) i'm not sure we can rule out being able to guarantee that all paths in produced graph spell different sequences. even if we cannot make that guarantee, i think the current clustering can be improved so as to reduce the average similarity of sequences in different branches

@iqbal-lab
Copy link
Contributor

i think it's an interaction of

  • the MSA (can we rely on the MSA algorithm to align the same sequence the same way on two different lines?)
  • the kmer-size for collapse k1
  • the kmer size for clustering k2
  • indels and local small scale indels shorter than k2.
  • nesting
    So i wonder if
    a) it might be possible to detect that there is >=1 indel and local sequence identity and decide to stop nesting further.
    b) i think i need to see a bunch of examples
    for your initial example @bricoletc i think there must be some error. you say

The path CA9T can be produced going through either sites 33/34/35 or 33/36.

I don't see how you can get CAT from 33/34/35, you can get CATTTC

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

i) whatever the upstream construction algorithm, downstream application needs some kind of awareness of path ambiguity if we know it can happen. in gramtools if i detect it at genotyping, I null all the calls in the path and set a "AMBIG" filter

That is interesting, we have no such things for pandora;

ii) i'm not sure we can rule out being able to guarantee that all paths in produced graph spell different sequences. even if we cannot make that guarantee, i think the current clustering can be improved so as to reduce the average similarity of sequences in different branches

Yeah, I think if I introduce some more formalisation, I can express better what I meant before. Consider the following problem:

Given a MSA M, a max nesting n, a min_match_length l, build a PRG that contains no ambiguous path

I have a feeling that this problem is NP-hard. But feeling does not mean much, a proof would be better... It might help to think about a related, but way simpler decision problem:

Given a labelled DAG G, does G have ambiguous paths?

I also have the feeling this previous problem is hard (i.e. NP-complete), so it is very unlikely we have a polynomial-time algorithm for it. But still no proof...

So I don't have much :p, we need to invest more time to come up with a proof if we want...

But in the end, we can look more at the practical side instead of the theoretical one... unless we have an idea for a polynomial-time algorithm, then the theoretical side is important. We have a trivial algorithm for the last problem if we go exponential time (i.e. list all paths, and check if we have duplicated paths), but I don't think it is a good idea... It will probably be very slow for gramtools PRGs (because they are whole genome, and we will have an exponentially large number of paths), and pandora has also some complex local graphs...

I guess the best direction would be the one you just mentioned, where we can't make the guarantee, but we make a best effort to reduce the average similarity of sequences in different branches... Zam's last reply also lean towards a more heuristic/data-oriented solution

@iqbal-lab
Copy link
Contributor

i actually think you might fix it by using smaller kmers in the clustering

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

The kmer size on clustering is 7, by default... Reducing might solve some cases, but then it might bug some other cases that were not bugged, I guess...

@iqbal-lab
Copy link
Contributor

yes, all changes may cause bugs. but smaller kmers for clustering would detect small repeats.

@iqbal-lab
Copy link
Contributor

but i think we need a list of 5 or 10 examples of where it happens, with their MSAs

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

yeah, smaller kmers also cause a lot more repeated kmers (e.g. we have 16384 7-mers, but 256 4-mers)... but the k-means clustering algorithm do take the k-mer multiplicity into account, but I have no idea what is the impact of reducing/increase k-mer size (although it seems to me k=7 is already very small, IDK anything using such small k, but for this particular problem, it can't also be large)... We could try variable k-mer size...

As we are at it, we could also contemplate other ways to measure distance between sequences, or to obtain a description of sequences. Things like doing all pairs edit distance, or some clustering based on edit distance/sequence similarity, so that we are not so sensitive to indels/SNPs (which kmers are a lot), but this might change a lot the PRGs...

@iqbal-lab
Copy link
Contributor

We need examples before we speculate - intuition from data and then move to possible solutions.
one issue is that we drop a vertical slice down a matching kmer, marked KKKKK, but because of the msa and indels, we might have

record 1: aaa KKKK bbbb
record 2. aa KKKK abbbb

then clustering on either side of the K's is missing the issue that the same sequence is split on either side of the K's.

@bricoletc
Copy link
Member Author

Interesting- though i think the problem of creating same seq in two paths that cross multiple var sites is different from path repeats (lets call them something? like path repeats) occurring within a single (nested) var site. The latter would seem more workable by modifying the clustering only

@iqbal-lab
Copy link
Contributor

You are right, but at the level of make-prg there is no concept of sites, apart from where you drop vertical slices of matches, separating regions within which there may be multiple sites.

@bricoletc
Copy link
Member Author

Here's another example which occurs when building PRG from m.tuberculosis, sequences from 1,017 samples, variation called with cortex in genomic positions: NC_000962.3:336600-337630 and this region starts at POS 337368

157.pdf

The following sequence: GCTCCGCCGGTCCCGCCGGTCC can be spelled by following two paths:

  • 157/1, 160/1, 161/0
  • 158/0, 159/0

Here's the MSA causing this path repeat containing graph (as per @rmcolq request):

(PS: this thread is becoming really long and unreadable, dont know if there's a solution to this)

>ref
gctccgccggtcccgccggtcc
>sample_48
gctccgccgggcccgccggtcc
>sample_84
tctccgccggtcccgccggtcc
>sample_331
gctcagccggtcccgccggtcc
>sample_347
gctccgccggtcccaccggtcc
>sample_385
gctccgccggtaccgccggtcc
>sample_450
gctccgctggtcccgccggtcc
>sample_577
gctccgccggtcccgctggtcc
>sample_796
gctccgccggtcccgccggtct
>sample_870
gctccgccggtcccgcctgtcc
>sample_952
gctccgccggtcctgccggtcc

@leoisl @iqbal-lab can we establish what we would like this to be constructed as?
No indels, kmer count matrix is i)Sparse (lots of 0 counts) ii)pretty 'uniform' (sequences are really quite similar, looks like on average one SNP apart)

Intuitively I might say we should not cluster and have one allele per sequence.

@iqbal-lab
Copy link
Contributor

Er..wow. @bricoletc what k size did you use for collapsing?

@bricoletc
Copy link
Member Author

bricoletc commented Oct 23, 2020

The default of 7; currently min_match_length is also the kmer size used for clustering. Could be using a kmer a bit smaller than that param would be advantageous 🤷

@iqbal-lab
Copy link
Contributor

I agree about not clustering here "ideally". Still trying to understand. On phone while bloody laptop installs updates

@bricoletc
Copy link
Member Author

😆
I think the key point here, is that the current algorithm will always find a grouping of sequences even if all sequences are extremely similar and there is no strong basis for it. This is because it only requires the inertia (sum of squared distances of data points to their centroids, ie assigned cluster centres) be halved; there is always a way of doing this and i think it mostly does not reach putting each seq in own cluster.

Here's a barebones example i've added as a failing test:

        sequences = ["CATATAAAATA", "CATATAACATA", "CATATAAGATA", "CATATAATATA"]
        seq_size = len(sequences[0])
        alignment = make_alignment(sequences)
        expected_clustering = [[record.id] for record in alignment]
        for kmer_size in range(4, 7):
            result = kmeans_cluster_seqs_in_interval(
                [0, seq_size - 1], alignment, kmer_size
            )
            self.assertEqual(expected_clustering, result)

The sequences all differ at one SNP, we probably shouldnt cluster anything right? The clustering currently returns three groups, one of which has two sequences.

@iqbal-lab
Copy link
Contributor

i think you're 100% right.

@iqbal-lab
Copy link
Contributor

iqbal-lab commented Oct 23, 2020

in case helpful, this is the data @bricoletc posted above here: #15 (comment)

IMG_20201023_091718

@iqbal-lab
Copy link
Contributor

i was still a bit confused how it is possible to get ambiguous paths in that example.

so basically what happened was 3 full alleles became their own separate paths. these ones
IMG_20201023_120304

and we got the graph you attached above:
Screenshot 2020-10-23 at 12 02 29

and as you say, because of this clustering and not collapsing directly between snps, you end up with 2 paths.

if we had not clustered, then pandora would do it's minimizer thing and i think would all be clean, and gramtools would be happy with its multiple alternate alleles.

@rmcolq
Copy link
Collaborator

rmcolq commented Oct 23, 2020 via email

@bricoletc
Copy link
Member Author

bricoletc commented Oct 23, 2020

That's exactly what I'd like to try @rmcolq ; however I think inertia is a relative measure, so exploring using edit distances (scaled by seq length). I'd be interested if you can see an 'absolute' inertia threshold being computable

@iqbal-lab
Copy link
Contributor

i think the thing i am missing, which i will periodically ponder on maybe over half-term holiday, is what the formal necessary and sufficient conditions are for there to be (or not be) two paths with the same sequence.

If there are no indels, and if the clustering guarantees that identical sequences cluster together, then i cannot see how it can happen.

Also, in the screenshot just above, you can see just below 157 that GCTCCGC is there at the start of all paths, and yet somehow that was not collapsed into a match interval.

@iqbal-lab
Copy link
Contributor

my mistake - they dont all start with GCTCCG

@iqbal-lab
Copy link
Contributor

Quick comments:

  1. Just a different perspective which i had not explicitly thought. suppose we have 10 alleles which we make an MSA from and pass into make-prg and get a graph. Fine. Now suppose we have a new allele, an 11th one, which is a recombinant of the first 10. If we add it - how should the graph change? The spiritual answer is it shouldn't change, right? We can all imagine circumstances where it might not work perfectly, but i think it's an enlightening perspective.

  2. If we are assuming we have an MSA already, then working out if all current alleles are ~the same as each other is easier, given that. Take a look at this picture - at the top of the msa i've calculated the number of bases that are ==the majority base. It's pretty clear from this all the rows are ~same. The bit in the middle , where you have 10 10 11 10 10 11 10 10 11, that's the most diverse bit, and it v much depends if all those 10s are due to the same or different samples. So you could count how often each row differs from the majority.

IMG_20201024_192113

@mbhall88
Copy link
Member

Is point 2 starting to close in on a variant of MEC? I wonder if there are some ideas we could use from WhatsHap's segregation/partitioning methods that would help think about this? We would obviously need to abstract to polyploid instead of diploid though.

@iqbal-lab
Copy link
Contributor

  1. We could conceivably reuse some data structures, but our first question (should we abandon clustering for this block /region) isn't something they think about
  2. I've been unhappy with how it's possible for this to happen without indels. Without indels, as long as two records are different, they must lead to different final alleles after traversing. So the main question is, can you get the same sequence in two different clusters? I think the answer might be yes in the following circumstance. Suppose we are at the start. We find two match intervals. Between them we have a 6bp interval which we want to cluster. So we write down the kmers and counts inside. If we use a kmer bigger than 6,then it is influenced by the bases outside. And this can lead to records which are identical in the non-match-interval being clustered apart. This is true even if the 6bp is much bigger, say 30bp,and our clustering is based on info about say 15mers, because the same thing can happen at the ends

@bricoletc
Copy link
Member Author

bricoletc commented Oct 26, 2020

@mbhall88 : as far as I'm aware, MEC in WhatsHap is used to convert a matrix of (reads x variant status) into haplotypes; here we have a set of haplotypes and want to convert to variant status. Is there a MEC formulation that does this?

@iqbal-lab: I'm glad you made this point about the majority, thats exactly what I've implemented as an attempt to fix!

I made the following algorithm:

  • Start with a set of >2 sequences to cluster and n_clusters = 2
  • For each cluster (initially, there is one with all sequences in it), compute the majority string and compute the hamming distance of each seq in the cluster to its majority string. If any seq has a distance of more than 20% its length, it is considered 'clusterable' (in the code, I say the cluster is not 'one-reference like'.
  • If any cluster is not 'one-ref like', cluster with kmeans with n_clusters, and increment n_clusters. Repeat point above.
  • If you reach too many clusters (10), stop clustering, and put each seq in its cluster (to counter pathological case: input is all possible 10-mers of DNA, avoiding algorithm clustering many many times)

Now we can make more statements about what clusters to expect:
i) In the ambiguous graph we've been looking at, no sequence is more than 0.2 away from the majority seq. No clustering
ii) If there are three groups of sequences, like this

     sequences = [
         "CCCCCCAACCT",
         "CCCCCCAATCT",
         "GGGGCGGGCCC",
         "GGGGCGGGCGC",
         "TTTAATTTTAA",
         "TTTAAGTTTAA",
     ]

The inertia-halving criterion is happy with making two clusters, because that reduces inertia sufficiently. This 'one-ref-like' is not, and only stops at three clusters; before that, there is at least one sequence > 0.2 to majority string in a cluster.

This has obvious heuristics: i) The threshold distance ii) The max num of clusterings.
I took the values of 0.2 and 10 for now, based on what homologous sequences often get thresholded as (>=80% identity i think) and how many forms of a gene we think can reasonably occur in a homologous stretch

@bricoletc
Copy link
Member Author

This is just an idea 🤷 . I thought it would make sense, after zam talked about how in the graph sequence clusters should look like they belong to different 'local references' (and what weve seen in an example in Pf). It solves all three failing cases I found when trying to trip up the inertia criterion, and seems reasonably fast to compute; something like O(nlk), for n sequences of size l in k final clusters (k up to 10)

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

I like this algorithm, although I have some questions/points:

I made the following algorithm:

  • Start with a set of >2 sequences to cluster and n_clusters = 2
  • For each cluster (initially, there is one with all sequences in it), compute the majority string and compute the hamming distance of each seq in the cluster to its majority string. If any seq has a distance of more than 20% its length, it is considered 'clusterable' (in the code, I say the cluster is not 'one-reference like'.

I think eventually we will have to replace hamming by edit distance to cope with indels, but I guess we can keep the majority string even with indels. I don't think we need to think hard on how to get a closer ref than the majority string, as this problem is NP-hard (e.g. https://link.springer.com/chapter/10.1007/978-3-642-14031-0_48 - it has a PTAS based on the hamming distance though, if the hamming distance is small, we can actually use this approximation scheme, but I guess it is better to just keep it simple for now, as I think the majority string will work well in practice)

  • If any cluster is not 'one-ref like', cluster with kmeans with n_clusters, and increment n_clusters. Repeat point above.

I think we can do sth more consistent/better here. We evaluate if we should cluster or not based on hamming distance with respect to the majority string. If we decide we should cluster, then we do this clustering based on kmer counts and k-means. I can see some situation where the hamming distance evaluation says we should cluster, and the k-means clustering produce clusters that is not what the evaluation expects (e.g. keep clustering sequences that are not very close together or put sequences that are close in different clusters). If this happens, we will have the same issue: we will either under- or over-cluster, as our evaluation and clustering schemes differ... and we can't really tell why. If we have some weird clustering, all I can tell you is: the k-means clustering did this, and I really don't know how to fix the clustering. It is really hard to understand what k-means does as it has one dimension per kmer (the space is thus large, the only way to maybe debug is doing PCA, but then we just look at 2 kmers...), the inertia parameter is hard to set, we can't normalise the inertia, etc...it is like a black box where we provide kmer counts and it clusters, but it is hard to grasp, in practice, how it will cluster. I really think we should simply do a recursive clustering based on the hamming distance WRT majority string:

  1. Compute majority string m_s and the hamming distance of each sequence s_i to m_s;
  2. Cluster sequences close to m_s in cluster 1, and far from m_s in cluster 2;
  3. Recursively cluster cluster 1 and cluster 2;

I think a hamming-distance-based clustering makes more sense with this new way of evaluating clusters, and what we actually want to do is to cluster sequences having low distance together. With this, we open this clustering black box, as we understand completely how it is done, and it is way easier to debug/fix. We can replace hamming by edit distance, and then we are not so sensitive to indels anymore (as hamming or kmer-based distances are).

  • If you reach too many clusters (10), stop clustering, and put each seq in its cluster (to counter pathological case: input is all possible 10-mers of DNA, avoiding algorithm clustering many many times)

I think we should remove this too-many-clusters threshold. We should cluster as much as we can, and add similar sequences to the same cluster. Let's think, for example, on the 10k E. coli Gal's data. If we need 30 clusters to partition these 10k genomes, we should have these 30 clusters. I wouldn't like 10 clusters, where in some of them we have sequences that are very different because we merged clusters to keep the limit 10. At the same time, I wouldn't like 10k clusters (each seq in its cluster), as there are a bunch of sequences that are very similar and can be clustered together. Making PRGs is extremely parallelisable, and I think c++/cython reimplementation will give us huge speed ups. We should think about thresholds to lower the runtime (like this one, to avoid clustering many times) only if performance is an issue, which I hardly think it will be. And I think the hamming/edit-distance based clustering will be very fast, this no need to limit the number of clusters.

So, to sum up, I really like this algorithm, and I think we should use it as the basis of a new version of make_prg. But I also think we need to change our clustering algorithm, do an efficient and multi-threaded implementation, and add parameters that limit our clustering only if we have performance issues.

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

PS: just emphasising sth Zam said earlier. This is very heuristic. We can come up with hundreds of ways to cluster sequences together, and many of them might seem good... I think we should do a data-oriented development, by having a set of MSAs and how the PRGs should be built from these MSAs. I don't know if you are keen to do this, as I think building this ground set will take much more time than developing, but it will be a really good set of end-to-end tests

@iqbal-lab
Copy link
Contributor

Will comment more when i come back but this is all v interesting and nice you took this forward @bricoletc !

  1. @leoisl we can have other reasons why we might want to limit clustering (eg we want to limit the number of nesting levels)
  2. i hadn't thought about multithreadability
  3. Leandro i don't think we will be able to define a dataset and set of criteria very easily - at some level we don't care too much about the graph in detail provided the downstream stuff is ok. i think removing ambiguous paths , or reducing, is a good criterion though.
    data oriented is good, but i'd start with looking at reduction of ambiguous paths, and indeed you should probsbly look at the effect on the e coli graphs via the 4way roc.

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

Yeah, it is interesting that we actually have an easy way to reproduce 4- and 20-way pipelines. We could simply apply changes to make_prg, and measure how it improves the precision/recall (and several other data we produce in the plots). 4-way pipeline should run in half a day now, but I think after make_prg reimplementation, it should take a couple of hours. I think that using the 4- and 20-way pipelines as a way to verify if new changes to make_prg are good or not is valid, so I guess we should proceed with the plan that the reimplementation should be as close as possible to the current code, and then we make changes and evaluate it under the 4- or 20-way pipelines? OFC @bricoletc can already implement changes and evaluate under his gramtools pipeline

@bricoletc
Copy link
Member Author

@leoisl thanks for going into detail, it would be nice to make slightly shorter comments though (eg in small bullet points), and have longer conversation on Slack for eg.

  • I believe hamming dist vs ed dist point is not an issue; I compute the hamming distance of the MSA substrings, ie they are already aligned. If I'm not mistaken, the hamming distance of that is the edit distance (provided the MSA worked well). It does include indels.
  • Could you please provide an example that you can see where the Hamming/edit-distance bsaed criterion say we should cluster and the kmer-based clustering would fail, eg group dissimilar sequences? I don't see a good reason why it should but I might be missing something. Also, there is no consideration of clustering inertia in this algorithm anymore.
  • I'm not sure I see how clustering based on distance to one majority string will work well. For eg, if there are four groups of sequences, the criterion will say no sequence looks like the majority string. Then there is no 'sequence close to m_s' to group together I believe. Am I missing something there?
  • I believe there should be a limit to number of clusters because of i) performance (I made a Python test clustering all 5mers and stopping at 10 already takes 10 seconds, ~1 second per k) and ii) I don't think biologically it makes sense to have for eg 100 clusters of sequences. If we think about homologous sequences under balancing selection, how likely is it that they fall in many many groups unbroken by recombination? In my Pf analysis, we're looking at 2 or maybe 3 clusters. This is just for debate, I really don't know 🤷

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

  • I believe hamming dist vs ed dist point is not an issue; I compute the hamming distance of the MSA substrings, ie they are already aligned. If I'm not mistaken, the hamming distance of that is the edit distance (provided the MSA worked well). It does include indels.

Yeah, you are right!

  • Could you please provide an example that you can see where the Hamming/edit-distance bsaed criterion say we should cluster and the kmer-based clustering would fail, eg group dissimilar sequences? I don't see a good reason why it should but I might be missing something. Also, there is no consideration of clustering inertia in this algorithm anymore.

Do you have the implementation in one of your branches? I think it is simpler to make a test example where we still produce bad clusters. It is hard to do the example by hand, as the k-means clustering is like a black box, and it is hard to visualise/predict how the clustering will go, even though we know how the algorithm works. It is just easier to implement, and write test cases. Although I am not sure it is worth doing this... I think we will always have corner cases where we produce bad clusters, the question is how often this happens in real data, and this is hard to answer... But I can try if you have the implementation

  • I'm not sure I see how clustering based on distance to one majority string will work well. For eg, if there are four groups of sequences, the criterion will say no sequence looks like the majority string. Then there is no 'sequence close to m_s' to group together I believe. Am I missing something there?

That is true, for a group of sequences that are diverge enough, and well balanced, we won't have a good single representative. This surely needs more work, I was just thinking if there is sth that is a substitute or better than k-means clustering (or at least a way to reduce its dimension). If you give me a set of strings, and ask me how the k-means will cluster this, I actually don't know how to answer. I would like to replace this black box by sth we understand better how will work. For this, I was also assuming that in the large majority of cases we would have a good representative sequence, that is hopefully close to the ancestral sequence that all the current sequences evolved from... but this might not be a good hypothesis?

  • I believe there should be a limit to number of clusters because of i) performance (I made a Python test clustering all 5mers and stopping at 10 already takes 10 seconds, ~1 second per k) and ii) I don't think biologically it makes sense to have for eg 100 clusters of sequences. If we think about homologous sequences under balancing selection, how likely is it that they fall in many many groups unbroken by recombination? In my Pf analysis, we're looking at 2 or maybe 3 clusters. This is just for debate, I really don't know shrug

Yeah, I think by what you and Zam said, we should indeed limit the number of clusters, it is not easy to set the correct value though... So, we make it a parameter, as you said, and leave the decision to the user.

But I still think for this new version of make_prg, we should direct the implementation towards getting better results/PRGs than before, and not worry yet about performance. I would not take python performance as a parameter, as we can do much better. I fell into premature optimisation trap before, and I try to avoid it at all costs now. I'd prefer us to make it work and make it right first, and then make it fast...

@iqbal-lab
Copy link
Contributor

@leoisl - might be a good idea for you to evaluate the current e coli prgs and count how many have ambiguous paths

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

@leoisl - might be a good idea for you to evaluate the current e coli prgs and count how many have ambiguous paths

I will try to do that, but can't be sure I will be able to do it. I think checking if a DAG has ambiguous paths is NP-Complete... OFC if our graphs are small enough, an exponential-time algorithm might be enough... I will be very practical here, I will just enumerate paths and check if there are duplicates. However, I am afraid that the PRGs where these things happen are usually complex, and thus this naive algorithm might take a lot of time to finish... well, it does not hurt to try though...

@leoisl
Copy link
Collaborator

leoisl commented Oct 26, 2020

I am also willing to try the most pragmatic and easy evaluation approach: I get the branch where Brice is implementing his changes, rerun the 4-way pipeline and see the results. It might be that things improve a lot, or results get worst, or no noticeable changes. But I think it is worth to try, it is very easy to do this, and it might give us some directions already (e.g. if results improve, then we at least have some evidence we should keep in this path)

@iqbal-lab
Copy link
Contributor

agree with both - brute force enumerate paths and see is good, and v parallelisable. and seeign what brice's code does to the 4way.

@iqbal-lab
Copy link
Contributor

(forget about np completeness)

@mbhall88
Copy link
Member

@mbhall88 : as far as I'm aware, MEC in WhatsHap is used to convert a matrix of (reads x variant status) into haplotypes; here we have a set of haplotypes and want to convert to variant status. Is there a MEC formulation that does this?

Not that I am aware of. I was just noticing some similarities. Anyways, sorry for the distraction.

@leoisl thanks for going into detail, it would be nice to make slightly shorter comments though (eg in small bullet points), and have longer conversation on Slack for eg.

Longer conversations definitely shouldn't be on Slack! Having everything in a github issue related to a topic is much easier to follow - especially in the future when we want to revisit discussions. And for one, I personally prefer longer form discussions.

@bricoletc
Copy link
Member Author

I personally struggle with the amount of information for example in this thread and think it could be condensed to much less than its current size; however, it was only a suggestion 😸

@leoisl first implementation here: https://github.com/bricoletc/make_prg/tree/dev
I would also hope to try it on a few small examples, check that we can still build dimorphic Pf gene graph, etc... I will not do this before a few days.

@iqbal-lab
Copy link
Contributor

leandro i'd wait til Brice has done his fuller testing before you launch the 4way, and anyway we need to know if there are ambiguous paths in the existing prgs
i'll stop stalking this issue/github, back in a few days

@bricoletc
Copy link
Member Author

Testing shows:

  • Improvements to TB benchmark (data in gramtools paper)
    --> Reduction in 'AMBIG' calls: these are sites where gramtools found two paths with the same sequence:
# Pre change vs post change to make_prg
N0054	N0054
{'AMBIG': 196}	{'AMBIG': 64}
N0155	N0155
{'AMBIG': 129}	{'AMBIG': 0}
N0004	N0004
{'AMBIG': 168}	{'AMBIG': 64}
N0072	N0072
{'AMBIG': 157}	{'AMBIG': 12}
N0153	N0153
{'AMBIG': 152}	{'AMBIG': 76}
N0052	N0052
{'AMBIG': 163}	{'AMBIG': 0}
N1216	N1216
{'AMBIG': 196}	{'AMBIG': 0}
N0136	N0136
{'AMBIG': 124}	{'AMBIG': 76}
N0157	N0157
{'AMBIG': 117}	{'AMBIG': 12}
N1177	N1177
{'AMBIG': 158}	{'AMBIG': 76}
N0145	N0145
{'AMBIG': 117}	{'AMBIG': 64}
N0031	N0031
{'AMBIG': 110}	{'AMBIG': 0}
N1283	N1283
{'AMBIG': 183}	{'AMBIG': 12}
N1202	N1202
{'AMBIG': 177}	{'AMBIG': 12}
N0091	N0091
{'AMBIG': 168}	{'AMBIG': 12}
N1272	N1272
{'AMBIG': 158}	{'AMBIG': 0}
N1176	N1176
{'AMBIG': 175}	{'AMBIG': 0}
  • Still find dimorphism in Pf surface antigen (data in gramtools paper)

  • Reduction in AMBIG calls when genotyping 706 Pf samples in a graph of four surface antigens:

# Pre change vs post change to make_prg
{'AMBIG': 111756}    {'AMBIG': 52930}

Annoyingly it doesn't remove all AMBIG calls. This could be a gramtools bug, or that make_prg still produces ambiguous paths. Either way, the proposed changes are alleviating the problem !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants