-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous path production #15
Comments
Is it possible to define a total order on alternate paths that spell the same sequence? God help us |
Here's another example which occurs when building PRG from m.tuberculosis, sequences from 1,017 samples, variation called with The following sequence:
|
At some level the problem seems to boil to excessive clustering, ie putting sequences that are very similar in different clusters. Another example where this has clearly occurred: This should potentially be a single variant site enumerating all the similar sequences. |
These must always be horizontally offset, right? |
is it possible without indels? i guess not? |
The kmer counting underlying the clustering ignores alignment gaps- if that's what you're referring to? Ie, gaps don't change anything: AAA-AAA and AA-AAAA are the same for clustering (at the moment) |
Doh, of course |
Rachel says couyld you please put the MSAs that generated these examples in? |
Sure, i will need to extract them, and reproduce the graphs- the images are subgraphs from larger PRGs. I will do this later this week, and also think we can add tests and some work to start alleviating this, if we can clearly define what's desirable and undesirable in these situations |
I will propose that the only thing you need to avoid is having two different source/sink paths which spell the same sequence. |
I took a closer look at the first example, and I think this is hard to solve for all cases. We might make some improvements here and there, based on some edge cases we find by hand, and it might work for these 3 examples, but this is just a very small number of cases, we won't be sure if it improved overall. I think this is due to a misclustering of sequences, but this is unavoidable as the k-means clustering process is very heuristic, and it might undercluster or overcluster (it is really hard to set the correct inertia parameter for every cluster, if there is one). So, I think that PRGs having ambiguous paths will unavoidably happen. Having this heuristic component also makes it hard to have any algorithm that will work on all situations (I think that whatever algorithm we think about, we can't have a proof of correctness, and we will always be able to develop an edge case in such a way that a misclustering of sequence happens, and this induces ambiguous paths in the PRG). So, we could think of post-processing algorithms to remove some edges/nodes from the PRG in order to keep a sort of path-spelling uniqueness property. That is not possible neither, I am afraid. In the first graph, it is clear we can remove the It seems to me that in some edge cases, we can have path ambiguity in the PRG, and it will be responsibility of the downstream algorithms to be aware and take care of this. Of course I could be wrong on any of the statements above, and as in bioinformatics, maybe we don't need a correct algorithm, but one that works on our cases (and hope they work on other)... |
PS: I am not advocating we replace k-means clustering. I think it makes total sense in make_prg, and we should use it as we are using now |
Thanks for pitching in @leoisl , I have two thoughts: |
i think it's an interaction of
I don't see how you can get CAT from 33/34/35, you can get CATTTC |
That is interesting, we have no such things for pandora;
Yeah, I think if I introduce some more formalisation, I can express better what I meant before. Consider the following problem:
I have a feeling that this problem is NP-hard. But feeling does not mean much, a proof would be better... It might help to think about a related, but way simpler decision problem:
I also have the feeling this previous problem is hard (i.e. NP-complete), so it is very unlikely we have a polynomial-time algorithm for it. But still no proof... So I don't have much :p, we need to invest more time to come up with a proof if we want... But in the end, we can look more at the practical side instead of the theoretical one... unless we have an idea for a polynomial-time algorithm, then the theoretical side is important. We have a trivial algorithm for the last problem if we go exponential time (i.e. list all paths, and check if we have duplicated paths), but I don't think it is a good idea... It will probably be very slow for gramtools PRGs (because they are whole genome, and we will have an exponentially large number of paths), and pandora has also some complex local graphs... I guess the best direction would be the one you just mentioned, where we can't make the guarantee, but we make a best effort to reduce the average similarity of sequences in different branches... Zam's last reply also lean towards a more heuristic/data-oriented solution |
i actually think you might fix it by using smaller kmers in the clustering |
The kmer size on clustering is 7, by default... Reducing might solve some cases, but then it might bug some other cases that were not bugged, I guess... |
yes, all changes may cause bugs. but smaller kmers for clustering would detect small repeats. |
but i think we need a list of 5 or 10 examples of where it happens, with their MSAs |
yeah, smaller kmers also cause a lot more repeated kmers (e.g. we have 16384 7-mers, but 256 4-mers)... but the k-means clustering algorithm do take the k-mer multiplicity into account, but I have no idea what is the impact of reducing/increase k-mer size (although it seems to me k=7 is already very small, IDK anything using such small k, but for this particular problem, it can't also be large)... We could try variable k-mer size... As we are at it, we could also contemplate other ways to measure distance between sequences, or to obtain a description of sequences. Things like doing all pairs edit distance, or some clustering based on edit distance/sequence similarity, so that we are not so sensitive to indels/SNPs (which kmers are a lot), but this might change a lot the PRGs... |
We need examples before we speculate - intuition from data and then move to possible solutions. record 1: aaa KKKK bbbb then clustering on either side of the K's is missing the issue that the same sequence is split on either side of the K's. |
Interesting- though i think the problem of creating same seq in two paths that cross multiple var sites is different from path repeats (lets call them something? like path repeats) occurring within a single (nested) var site. The latter would seem more workable by modifying the clustering only |
You are right, but at the level of make-prg there is no concept of sites, apart from where you drop vertical slices of matches, separating regions within which there may be multiple sites. |
Here's the MSA causing this path repeat containing graph (as per @rmcolq request): (PS: this thread is becoming really long and unreadable, dont know if there's a solution to this)
@leoisl @iqbal-lab can we establish what we would like this to be constructed as? Intuitively I might say we should not cluster and have one allele per sequence. |
Er..wow. @bricoletc what k size did you use for collapsing? |
The default of 7; currently |
I agree about not clustering here "ideally". Still trying to understand. On phone while bloody laptop installs updates |
😆 Here's a barebones example i've added as a failing test:
The sequences all differ at one SNP, we probably shouldnt cluster anything right? The clustering currently returns three groups, one of which has two sequences. |
i think you're 100% right. |
in case helpful, this is the data @bricoletc posted above here: #15 (comment) |
Is there a threshold inertia (perhaps proportional to the length of the sequences) where we could say it is silly to cluster?
Sent from my Samsung Galaxy smartphone.
…-------- Original message --------
From: Brice Letcher <[email protected]>
Date: 23/10/2020 09:30 (GMT+00:00)
To: rmcolq/make_prg <[email protected]>
Cc: COLQUHOUN Rachel <[email protected]>, Mention <[email protected]>
Subject: Re: [rmcolq/make_prg] Ambiguous path production (#15)
😆
I think the key point here, is that the current algorithm will always find a grouping of sequences even if all sequences are extremely similar and there is no strong basis for it. This is because it only requires the inertia (sum of squared distances of data points to their centroids, ie assigned cluster centres) be halved; there is always a way of doing this and i think it mostly does not reach putting each seq in own cluster.
Here's a barebones example i've added as a failing test:
sequences = ["CATATAAAATA", "CATATAACATA", "CATATAAGATA", "CATATAATATA"]
seq_size = len(sequences[0])
alignment = make_alignment(sequences)
expected_clustering = [[record.id] for record in alignment]
for kmer_size in range(4, 7):
result = kmeans_cluster_seqs_in_interval(
[0, seq_size - 1], alignment, kmer_size
)
self.assertEqual(expected_clustering, result)
The sequences all differ at one SNP, we probably shouldnt cluster anything right? The clustering currently returns three groups, one of which has two sequences.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#15 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACLIWO7YSMLHPAR4OAWK3VTSME5IHANCNFSM4N5MPMRA>.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
|
That's exactly what I'd like to try @rmcolq ; however I think inertia is a relative measure, so exploring using edit distances (scaled by seq length). I'd be interested if you can see an 'absolute' inertia threshold being computable |
i think the thing i am missing, which i will periodically ponder on maybe over half-term holiday, is what the formal necessary and sufficient conditions are for there to be (or not be) two paths with the same sequence. If there are no indels, and if the clustering guarantees that identical sequences cluster together, then i cannot see how it can happen. Also, in the screenshot just above, you can see just below 157 that GCTCCGC is there at the start of all paths, and yet somehow that was not collapsed into a match interval. |
my mistake - they dont all start with GCTCCG |
Quick comments:
|
Is point 2 starting to close in on a variant of MEC? I wonder if there are some ideas we could use from WhatsHap's segregation/partitioning methods that would help think about this? We would obviously need to abstract to polyploid instead of diploid though. |
|
@mbhall88 : as far as I'm aware, MEC in WhatsHap is used to convert a matrix of (reads x variant status) into haplotypes; here we have a set of haplotypes and want to convert to variant status. Is there a MEC formulation that does this? @iqbal-lab: I'm glad you made this point about the majority, thats exactly what I've implemented as an attempt to fix! I made the following algorithm:
Now we can make more statements about what clusters to expect:
The inertia-halving criterion is happy with making two clusters, because that reduces inertia sufficiently. This 'one-ref-like' is not, and only stops at three clusters; before that, there is at least one sequence > 0.2 to majority string in a cluster. This has obvious heuristics: i) The threshold distance ii) The max num of clusterings. |
This is just an idea 🤷 . I thought it would make sense, after zam talked about how in the graph sequence clusters should look like they belong to different 'local references' (and what weve seen in an example in Pf). It solves all three failing cases I found when trying to trip up the inertia criterion, and seems reasonably fast to compute; something like |
I like this algorithm, although I have some questions/points:
I think eventually we will have to replace hamming by edit distance to cope with indels, but I guess we can keep the majority string even with indels. I don't think we need to think hard on how to get a closer ref than the majority string, as this problem is NP-hard (e.g. https://link.springer.com/chapter/10.1007/978-3-642-14031-0_48 - it has a PTAS based on the hamming distance though, if the hamming distance is small, we can actually use this approximation scheme, but I guess it is better to just keep it simple for now, as I think the majority string will work well in practice)
I think we can do sth more consistent/better here. We evaluate if we should cluster or not based on hamming distance with respect to the majority string. If we decide we should cluster, then we do this clustering based on kmer counts and k-means. I can see some situation where the hamming distance evaluation says we should cluster, and the k-means clustering produce clusters that is not what the evaluation expects (e.g. keep clustering sequences that are not very close together or put sequences that are close in different clusters). If this happens, we will have the same issue: we will either under- or over-cluster, as our evaluation and clustering schemes differ... and we can't really tell why. If we have some weird clustering, all I can tell you is: the k-means clustering did this, and I really don't know how to fix the clustering. It is really hard to understand what k-means does as it has one dimension per kmer (the space is thus large, the only way to maybe debug is doing PCA, but then we just look at 2 kmers...), the inertia parameter is hard to set, we can't normalise the inertia, etc...it is like a black box where we provide kmer counts and it clusters, but it is hard to grasp, in practice, how it will cluster. I really think we should simply do a recursive clustering based on the hamming distance WRT majority string:
I think a hamming-distance-based clustering makes more sense with this new way of evaluating clusters, and what we actually want to do is to cluster sequences having low distance together. With this, we open this clustering black box, as we understand completely how it is done, and it is way easier to debug/fix. We can replace hamming by edit distance, and then we are not so sensitive to indels anymore (as hamming or kmer-based distances are).
I think we should remove this too-many-clusters threshold. We should cluster as much as we can, and add similar sequences to the same cluster. Let's think, for example, on the 10k E. coli Gal's data. If we need 30 clusters to partition these 10k genomes, we should have these 30 clusters. I wouldn't like 10 clusters, where in some of them we have sequences that are very different because we merged clusters to keep the limit 10. At the same time, I wouldn't like 10k clusters (each seq in its cluster), as there are a bunch of sequences that are very similar and can be clustered together. Making PRGs is extremely parallelisable, and I think c++/cython reimplementation will give us huge speed ups. We should think about thresholds to lower the runtime (like this one, to avoid clustering many times) only if performance is an issue, which I hardly think it will be. And I think the hamming/edit-distance based clustering will be very fast, this no need to limit the number of clusters. So, to sum up, I really like this algorithm, and I think we should use it as the basis of a new version of make_prg. But I also think we need to change our clustering algorithm, do an efficient and multi-threaded implementation, and add parameters that limit our clustering only if we have performance issues. |
PS: just emphasising sth Zam said earlier. This is very heuristic. We can come up with hundreds of ways to cluster sequences together, and many of them might seem good... I think we should do a data-oriented development, by having a set of MSAs and how the PRGs should be built from these MSAs. I don't know if you are keen to do this, as I think building this ground set will take much more time than developing, but it will be a really good set of end-to-end tests |
Will comment more when i come back but this is all v interesting and nice you took this forward @bricoletc !
|
Yeah, it is interesting that we actually have an easy way to reproduce 4- and 20-way pipelines. We could simply apply changes to make_prg, and measure how it improves the precision/recall (and several other data we produce in the plots). 4-way pipeline should run in half a day now, but I think after make_prg reimplementation, it should take a couple of hours. I think that using the 4- and 20-way pipelines as a way to verify if new changes to make_prg are good or not is valid, so I guess we should proceed with the plan that the reimplementation should be as close as possible to the current code, and then we make changes and evaluate it under the 4- or 20-way pipelines? OFC @bricoletc can already implement changes and evaluate under his gramtools pipeline |
@leoisl thanks for going into detail, it would be nice to make slightly shorter comments though (eg in small bullet points), and have longer conversation on Slack for eg.
|
Yeah, you are right!
Do you have the implementation in one of your branches? I think it is simpler to make a test example where we still produce bad clusters. It is hard to do the example by hand, as the k-means clustering is like a black box, and it is hard to visualise/predict how the clustering will go, even though we know how the algorithm works. It is just easier to implement, and write test cases. Although I am not sure it is worth doing this... I think we will always have corner cases where we produce bad clusters, the question is how often this happens in real data, and this is hard to answer... But I can try if you have the implementation
That is true, for a group of sequences that are diverge enough, and well balanced, we won't have a good single representative. This surely needs more work, I was just thinking if there is sth that is a substitute or better than k-means clustering (or at least a way to reduce its dimension). If you give me a set of strings, and ask me how the k-means will cluster this, I actually don't know how to answer. I would like to replace this black box by sth we understand better how will work. For this, I was also assuming that in the large majority of cases we would have a good representative sequence, that is hopefully close to the ancestral sequence that all the current sequences evolved from... but this might not be a good hypothesis?
Yeah, I think by what you and Zam said, we should indeed limit the number of clusters, it is not easy to set the correct value though... So, we make it a parameter, as you said, and leave the decision to the user. But I still think for this new version of make_prg, we should direct the implementation towards getting better results/PRGs than before, and not worry yet about performance. I would not take python performance as a parameter, as we can do much better. I fell into premature optimisation trap before, and I try to avoid it at all costs now. I'd prefer us to make it work and make it right first, and then make it fast... |
@leoisl - might be a good idea for you to evaluate the current e coli prgs and count how many have ambiguous paths |
I will try to do that, but can't be sure I will be able to do it. I think checking if a DAG has ambiguous paths is NP-Complete... OFC if our graphs are small enough, an exponential-time algorithm might be enough... I will be very practical here, I will just enumerate paths and check if there are duplicates. However, I am afraid that the PRGs where these things happen are usually complex, and thus this naive algorithm might take a lot of time to finish... well, it does not hurt to try though... |
I am also willing to try the most pragmatic and easy evaluation approach: I get the branch where Brice is implementing his changes, rerun the 4-way pipeline and see the results. It might be that things improve a lot, or results get worst, or no noticeable changes. But I think it is worth to try, it is very easy to do this, and it might give us some directions already (e.g. if results improve, then we at least have some evidence we should keep in this path) |
agree with both - brute force enumerate paths and see is good, and v parallelisable. and seeign what brice's code does to the 4way. |
(forget about np completeness) |
Not that I am aware of. I was just noticing some similarities. Anyways, sorry for the distraction.
Longer conversations definitely shouldn't be on Slack! Having everything in a github issue related to a topic is much easier to follow - especially in the future when we want to revisit discussions. And for one, I personally prefer longer form discussions. |
I personally struggle with the amount of information for example in this thread and think it could be condensed to much less than its current size; however, it was only a suggestion 😸 @leoisl first implementation here: https://github.com/bricoletc/make_prg/tree/dev |
leandro i'd wait til Brice has done his fuller testing before you launch the 4way, and anyway we need to know if there are ambiguous paths in the existing prgs |
Testing shows:
Annoyingly it doesn't remove all AMBIG calls. This could be a gramtools bug, or that make_prg still produces ambiguous paths. Either way, the proposed changes are alleviating the problem ! |
I produced nested PRGs (
--min_match_length
=7,--max_nesting
=5) from multiple sequence alignments of plasmodium genes and found thatmake_prg
can produce ambiguous PRGs.Here's an example I drew out by hand to understand
_o
and_c
mean site opening and closing nodes, with a numbered ID before.7T
means 7 consecutive Ts.The path
CA9T
can be produced going through either sites 33/34/35 or 33/36. This can be horrible for genotyping as if the dataset has that sequence then you have to either say you can't genotype because of ambiguity, or randomly choose one of the two possibilities. The latter could cause comparison issues across datasets. For now ingramtools
I've decided to bail out in these cases and null genotype, with a FILTER flag to AMBIG to signal this occurred.The text was updated successfully, but these errors were encountered: