Submat-batches can lead to non-stable cluster IDs #10

denisbeslic · 2022-06-03T10:01:12Z

New Issue which came up with feature/speedup-submat-batches-tweaked
The mutation length loop assigned the cluster IDs according to the order of the shortest mutation profile up to the longest mutation profile. This would mean that we have non-stable cluster IDs when adding sequences to the input file. To counter that I added a small loop to assign the cluster IDs according to the order of the input file. In cases where we cache the mutation profiles and a cached sequence would be modified in the next run, breakfast would assign a different cluster ID than in the first run. The contents of the cluster would not change, only the number.

Example

First File testfile.tsv
Second file: testfile_caching06_ModifiedSequences.tsv
Example2-2 was modified and will not be part of the same cluster anymore.

Output of first run

id	cluster_id
example1-1	1
example1-2	1
example1-3	1
example2-1	1
example2-2	1
exampledel1	2
exampledel2	2

Output of second run

id	cluster_id
example1-1	1
example1-2	1
example1-3	1
example2-1	1
example2-2	2
exampledel1	3
exampledel2	3
example3-1	2
example3-2	2

Expected output of second run

id	cluster_id
example1-1	1
example1-2	1
example1-3	1
example2-1	1
example2-2	3
exampledel1	2
exampledel2	2
example3-1	3
example3-2	3

denisbeslic added the bug Something isn't working label Jun 3, 2022

denisbeslic changed the title ~~Non-Stable cluster IDs~~ Submat-batches can lead to non-stable cluster IDs Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submat-batches can lead to non-stable cluster IDs #10

Submat-batches can lead to non-stable cluster IDs #10

denisbeslic commented Jun 3, 2022 •

edited

Loading

Submat-batches can lead to non-stable cluster IDs #10

Submat-batches can lead to non-stable cluster IDs #10

Comments

denisbeslic commented Jun 3, 2022 • edited Loading

denisbeslic commented Jun 3, 2022 •

edited

Loading