-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big methods comparison table #322
Comments
Looking at sc2ts versus RecombinHunt first, as it's mostly done but needs some polishing. Will then look at RIPPLES (probably the most important comparator here, but the results are the trickiest to compare with) and CoVRecomb. |
In terms of ground truth from Pango X lineages, right? |
Yes, oops, neglected to mention above. |
First pass of these comparisons will be on the simple recombinants which have one breakpoint and two parents. The more complex recombinants will be analysed later. In the alias key of commit The initial table includes:
|
Okay, here's my first pass to compare (1) sc2ts and the ground truth and (2) RecombinHunt-GISAID with the ground truth, looking at the number of recombinants concordant in terms of the Pango labels of the left and right parents. Of the 51 Pango X labels, 9 were not in Viridian v04 or had too many missing sites. I think it is fair to exclude them in the comparisons. Both sc2ts and RH identified concordant recombinants for 29 Pango X (overlapping by 25). I'm not sure how to handle XAS, which has messy origins in our ARG. It is not included in the counts for now. Also, XBH is a bit tricky. sc2ts identified it as a one-breakpoint recombinant, whereas RH identified it as a two-breakpoint recombinant. It is not included in the counts for now. EDIT: I wavered a bit about XBE. The ground truth says that its right Pango parent is BE.4.1, but RH says it's BE.4, so at lower resolution. XBE is not included in the counts for now. |
I manually set some rules for mapping Pango labels and determining concordance between Pango labels. def compare_pangos(pango_1, pango_2):
def _remap(x):
if x.endswith("*"):
# B.1.617.2*, BA.1*, BA.1.1*, BA.2*, BA.5.2*
return x.replace("*", "")
elif x.startswith("AY"):
return x.replace("AY", "B.1.617.2")
#Unsure
#elif x == "BE.4.1":
# return "BE.4"
elif x == "BF.3":
return "BA.5" # Truncated
elif x == "BF.5.2":
return "BA.5.2" # Truncated
elif x == "BJ.1":
return "BA.2" # Truncated
elif x == "BN.3.1":
return "BA.2.75" # Truncated
elif x == "Q.1":
return "B.1.1.7.1"
elif x.startswith("BM"):
return "BA.2.75" # Truncated
elif x.startswith("CJ"):
return "BA.2.75" # Truncated
else:
return x
pango_1 = _remap(pango_1)
pango_2 = _remap(pango_2)
if (pango_2 == pango_1) or (pango_2.startswith(pango_1 + ".")):
return True
else:
return False |
So they are both concordant at 29/51 of the pango X lineages? |
Yes, up to the Pango resolution levels defined in the rules above (hard to set them exactly across all the cases). |
We'll need to spell pango remapping and comparison thing out a bit so will need to be clear. Naively, this is how I would have worked:
That should take care of all the prefix matching etc? Can we automate step 2 using the synonyms file from pango? |
There are some cases where I think the Pango parents agree at a lower resolution level, but they differ at higher levels. So, for such cases, I truncated some levels, for example:
The left Pango parents are concordant at BA.2. |
Hmm, that's messy all right. I don't think it's correct to say that it's concordant if the ground truth is B.1.1.529.2.10.1.1 and we say it's B.1.1.529.2.10 - that's quite a lot less specific. We could reductio ad absurdum say that the parents have pango B and be corcordant 100% of the time. |
If we don't truncate like I was doing, then sc2ts got 27 concordant recombinants whereas RH-GISAID got 30 (above, I made a mistake somewhere when manually deciding which Pango X to count or not). Replacing the above def compare_pangos(pango_1, pango_2):
def _remap(x):
if x.endswith("*"):
x = x.replace("*", "")
first_level = x.split(".")[0]
if first_level in alias_key:
if alias_key[first_level] != "":
# Not A or B.
x = x.replace(first_level, alias_key[first_level])
return x
pango_1 = _remap(pango_1)
pango_2 = _remap(pango_2)
if (pango_2 == pango_1) or (pango_2.startswith(pango_1 + ".")):
return True
else:
return False |
I don't get this bit:
why not
I think it would help to explicitly identify the query and ground_truth values so that the asymmetry is explicit. We are concondant if the query is at least or more specific than the ground_truth and not otherwise. It's fine if we're less specific than ground truth here in a few cases, we can look at those individually. |
Oops, because so far, I've always set |
So, something like def is_concordant(ground_truth, query):
if ground_truth.endswith(".*")
ground_truth = ground_truth[:-2]
split = query.split(".")
alias = alias_key.get(split[0], "")
if alias != "":
query = alias + ".".join(split[1:])
return query.starts_with(ground_truth) |
Yes, but we have to replace the alias for both the ground truth and the comparator Pango labels. |
From discussion with @jeromekelleher, we should add some metrics to compare breakpoint intervals:
|
In Viridian v04, these Pango Xs have only the sublineage labels.
We have the Pango parents and breakpoint intervals for XBF and XBK, but not XBJ. Need to check the samples with XBJ.2 and XBJ.4. |
Summary of comparison of breakpoint intervals so far, considering only Pango X identified as simple recombinants in ground truth and comparator method.
For sc2ts, of the 10 Pango X with non-overlapping intervals, 6 (XBB, XZ, XAP, XAF, XC, and XAD) are within the ballpark, I think. The other 4 (XAM, XAA, XAG, and XU) do not have overlapping intervals, but the intervals occur on the 5' end in ORF1ab. Interestingly, these four are nested as per sc2ts, with the breakpoint interval 4322-5386; as per ground truth, the breakpoints are within ~6500 and ~9300. |
After discussion with @jeromekelleher, we want to have separate issues tracking different Pango Xs (or sets of them). |
Curate a table with the recombinant detection results for the Pango X lineages of sc2ts and other methods alongside the "ground-truth" information from
pango-designation
(i.e. the Pango identities of the parents and the breakpoint intervals).Methods
The text was updated successfully, but these errors were encountered: