Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big methods comparison table #322

Open
szhan opened this issue Feb 12, 2025 · 20 comments
Open

Big methods comparison table #322

szhan opened this issue Feb 12, 2025 · 20 comments

Comments

@szhan
Copy link
Collaborator

szhan commented Feb 12, 2025

Curate a table with the recombinant detection results for the Pango X lineages of sc2ts and other methods alongside the "ground-truth" information from pango-designation (i.e. the Pango identities of the parents and the breakpoint intervals).

Methods

@szhan szhan added this to the First submission milestone Feb 12, 2025
@szhan
Copy link
Collaborator Author

szhan commented Feb 12, 2025

Looking at sc2ts versus RecombinHunt first, as it's mostly done but needs some polishing. Will then look at RIPPLES (probably the most important comparator here, but the results are the trickiest to compare with) and CoVRecomb.

@jeromekelleher
Copy link
Owner

In terms of ground truth from Pango X lineages, right?

@szhan
Copy link
Collaborator Author

szhan commented Feb 12, 2025

In terms of ground truth from Pango X lineages, right?

Yes, oops, neglected to mention above.

@szhan
Copy link
Collaborator Author

szhan commented Feb 13, 2025

First pass of these comparisons will be on the simple recombinants which have one breakpoint and two parents. The more complex recombinants will be analysed later.

In the alias key of commit e81e441, there are:
Total Pango X (not considering all the XBB sublineages as distinct): 144
One breakpoint: 123
More breakpoints: 21

The initial table includes:

  • Ground truth (info from pango-designation curated by the RecombinHunt authors).
  • RecombinHunt results from GISAID.
  • RecombinHunt results from Nextstrain.

@szhan
Copy link
Collaborator Author

szhan commented Feb 16, 2025

Okay, here's my first pass to compare (1) sc2ts and the ground truth and (2) RecombinHunt-GISAID with the ground truth, looking at the number of recombinants concordant in terms of the Pango labels of the left and right parents.

Of the 51 Pango X labels, 9 were not in Viridian v04 or had too many missing sites. I think it is fair to exclude them in the comparisons.

Both sc2ts and RH identified concordant recombinants for 29 Pango X (overlapping by 25).

I'm not sure how to handle XAS, which has messy origins in our ARG. It is not included in the counts for now.

Also, XBH is a bit tricky. sc2ts identified it as a one-breakpoint recombinant, whereas RH identified it as a two-breakpoint recombinant. It is not included in the counts for now.

EDIT: I wavered a bit about XBE. The ground truth says that its right Pango parent is BE.4.1, but RH says it's BE.4, so at lower resolution. XBE is not included in the counts for now.

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

I manually set some rules for mapping Pango labels and determining concordance between Pango labels.

def compare_pangos(pango_1, pango_2):
    def _remap(x):
        if x.endswith("*"):
            # B.1.617.2*, BA.1*, BA.1.1*, BA.2*, BA.5.2*
            return x.replace("*", "")
        elif x.startswith("AY"):
            return x.replace("AY", "B.1.617.2")
        #Unsure
        #elif x == "BE.4.1":
        #    return "BE.4"
        elif x == "BF.3":
            return "BA.5"    # Truncated
        elif x == "BF.5.2":
            return "BA.5.2"    # Truncated
        elif x == "BJ.1":
            return "BA.2"    # Truncated
        elif x == "BN.3.1":
            return "BA.2.75"    # Truncated
        elif x == "Q.1":
            return "B.1.1.7.1"
        elif x.startswith("BM"):
            return "BA.2.75"    # Truncated
        elif x.startswith("CJ"):
            return "BA.2.75"    # Truncated
        else:
            return x
    pango_1 = _remap(pango_1)
    pango_2 = _remap(pango_2)
    if (pango_2 == pango_1) or (pango_2.startswith(pango_1 + ".")):
        return True
    else:
        return False

@jeromekelleher
Copy link
Owner

So they are both concordant at 29/51 of the pango X lineages?

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

So they are both concordant at 29/51 of the pango X lineages?

Yes, up to the Pango resolution levels defined in the rules above (hard to set them exactly across all the cases).

@jeromekelleher
Copy link
Owner

We'll need to spell pango remapping and comparison thing out a bit so will need to be clear. Naively, this is how I would have worked:

  1. Strip off any ".*" values
  2. Fully expand all Pango references, so that e.g., AY.x -> "B.1.617.2.x"
  3. Given a truth value A and query B, we say that B is concordant with A if B.startswith(A).

That should take care of all the prefix matching etc?

Can we automate step 2 using the synonyms file from pango?

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

There are some cases where I think the Pango parents agree at a lower resolution level, but they differ at higher levels. So, for such cases, I truncated some levels, for example:

method pango left_pango_parent right_pango_parent breakpoint_interval
groundtruth XBB BJ.1 (B.1.1.529.2.10.1.1) BM.1.1.1 22891-22935
recombinhunt_gisaid XBB BJ.1 (B.1.1.529.2.10.1.1) BM.1.1.1 22897-22898
recombinhunt_nextstrain XBB BA.2.9 (B.1.1.529.2.9) BM.1.1.1 22331-22332
sc2ts XBB BA.2.10 (B.1.1.529.2.10) BM.1.1.1 22332-22577
  • BA = B.1.1.529
  • BJ = B.1.1.529.2.10.1

The left Pango parents are concordant at BA.2.

@jeromekelleher
Copy link
Owner

jeromekelleher commented Feb 17, 2025

Hmm, that's messy all right. I don't think it's correct to say that it's concordant if the ground truth is B.1.1.529.2.10.1.1 and we say it's B.1.1.529.2.10 - that's quite a lot less specific. We could reductio ad absurdum say that the parents have pango B and be corcordant 100% of the time.

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

If we don't truncate like I was doing, then sc2ts got 27 concordant recombinants whereas RH-GISAID got 30 (above, I made a mistake somewhere when manually deciding which Pango X to count or not).

Replacing the above compare_pangos with the following:

def compare_pangos(pango_1, pango_2):
    def _remap(x):
        if x.endswith("*"):
            x = x.replace("*", "")
        first_level = x.split(".")[0]
        if first_level in alias_key:
            if alias_key[first_level] != "":
                # Not A or B.
                x = x.replace(first_level, alias_key[first_level])
        return x
    pango_1 = _remap(pango_1)
    pango_2 = _remap(pango_2)
    if (pango_2 == pango_1) or (pango_2.startswith(pango_1 + ".")):
        return True
    else:
        return False

@jeromekelleher
Copy link
Owner

I don't get this bit:

    if (pango_2 == pango_1) or (pango_2.startswith(pango_1 + ".")):

why not

return query.starts_with(ground_truth)

I think it would help to explicitly identify the query and ground_truth values so that the asymmetry is explicit. We are concondant if the query is at least or more specific than the ground_truth and not otherwise.

It's fine if we're less specific than ground truth here in a few cases, we can look at those individually.

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

Oops, because so far, I've always set pango_1 to be the ground truth.

@jeromekelleher
Copy link
Owner

So, something like

def is_concordant(ground_truth, query):
    if ground_truth.endswith(".*")
          ground_truth = ground_truth[:-2]
    split = query.split(".") 
    alias = alias_key.get(split[0], "")
    if  alias != "":
         query = alias + ".".join(split[1:])
   return query.starts_with(ground_truth)

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

Yes, but we have to replace the alias for both the ground truth and the comparator Pango labels.

@szhan
Copy link
Collaborator Author

szhan commented Feb 17, 2025

From discussion with @jeromekelleher, we should add some metrics to compare breakpoint intervals:

  • Whether the comparator's interval intersects with the ground truth interval (True or False).
  • Size of the overlapping region / size of the ground truth interval.
  • Size of the comparator's interval / size of the ground truth interval.

@szhan
Copy link
Collaborator Author

szhan commented Feb 23, 2025

In Viridian v04, these Pango Xs have only the sublineage labels.

  • XBF = XBF.1;XBF.2;XBF.3;XBF.4;XBF.5;XBF.6;XBF.7;XBF.7.1;XBF.9;XBF.10
  • XBK = XBK.1
  • XBJ = XBJ.2;XBJ.4

We have the Pango parents and breakpoint intervals for XBF and XBK, but not XBJ. Need to check the samples with XBJ.2 and XBJ.4.

@szhan
Copy link
Collaborator Author

szhan commented Feb 24, 2025

Summary of comparison of breakpoint intervals so far, considering only Pango X identified as simple recombinants in ground truth and comparator method.

category sc2ts RecombinHunt-GISAID
total 29 31
overlap 19 19

For sc2ts, of the 10 Pango X with non-overlapping intervals, 6 (XBB, XZ, XAP, XAF, XC, and XAD) are within the ballpark, I think. The other 4 (XAM, XAA, XAG, and XU) do not have overlapping intervals, but the intervals occur on the 5' end in ORF1ab. Interestingly, these four are nested as per sc2ts, with the breakpoint interval 4322-5386; as per ground truth, the breakpoints are within ~6500 and ~9300.

@szhan
Copy link
Collaborator Author

szhan commented Feb 24, 2025

After discussion with @jeromekelleher, we want to have separate issues tracking different Pango Xs (or sets of them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants