Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submat-batches can lead to non-stable cluster IDs #10

Open
denisbeslic opened this issue Jun 3, 2022 · 0 comments
Open

Submat-batches can lead to non-stable cluster IDs #10

denisbeslic opened this issue Jun 3, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@denisbeslic
Copy link
Contributor

denisbeslic commented Jun 3, 2022

New Issue which came up with feature/speedup-submat-batches-tweaked
The mutation length loop assigned the cluster IDs according to the order of the shortest mutation profile up to the longest mutation profile. This would mean that we have non-stable cluster IDs when adding sequences to the input file. To counter that I added a small loop to assign the cluster IDs according to the order of the input file. In cases where we cache the mutation profiles and a cached sequence would be modified in the next run, breakfast would assign a different cluster ID than in the first run. The contents of the cluster would not change, only the number.

Example

First File testfile.tsv
Second file: testfile_caching06_ModifiedSequences.tsv
Example2-2 was modified and will not be part of the same cluster anymore.

Output of first run

id cluster_id
example1-1 1
example1-2 1
example1-3 1
example2-1 1
example2-2 1
exampledel1 2
exampledel2 2

Output of second run

id cluster_id
example1-1 1
example1-2 1
example1-3 1
example2-1 1
example2-2 2
exampledel1 3
exampledel2 3
example3-1 2
example3-2 2

Expected output of second run

id cluster_id
example1-1 1
example1-2 1
example1-3 1
example2-1 1
example2-2 3
exampledel1 2
exampledel2 2
example3-1 3
example3-2 3
@denisbeslic denisbeslic added the bug Something isn't working label Jun 3, 2022
@denisbeslic denisbeslic changed the title Non-Stable cluster IDs Submat-batches can lead to non-stable cluster IDs Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant