Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of benchmark evaluation #280

Open
Bae-SungHan opened this issue Jan 13, 2025 · 5 comments
Open

Reproducibility of benchmark evaluation #280

Bae-SungHan opened this issue Jan 13, 2025 · 5 comments

Comments

@Bae-SungHan
Copy link

Bae-SungHan commented Jan 13, 2025

Hi
I've been trying to reproduce PoseBusters benchmark score you reported, but unfortunately I haven't been able to do so yet. So I would like to ask if I could get help with this.
Here's my process to infer PoseBusters dataset locally;

  1. Directly parsed protein sequence, modified residue info, cofactor list, and number of ligand of interest from predicted pdb file provided in https://chaiassets.com/chai-1/paper/assets/posebusters_predictions.zip.
    Below is the code I used;
def parse_infos(pdb_file, ligand_ccd, ligand_smiles):
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("structure", pdb_file)

    chain_to_seq = {}
    modified_residue_dict = {}
    cofactors = []
    ligand_count = 0
    for model in structure:
        for chain in model:
            sequence = ""
            modified_residues = dict()
            for residue_id, residue in enumerate(chain):
                resname = residue.get_resname()
                if resname == ligand_ccd:
                    ligand_count += 1
                elif is_aa(residue, standard=False):
                    if resname in standard_aa_names:
                        aa = index_to_one(three_to_index(resname))
                        sequence += aa
                    elif resname in modified_residue_to_parent:
                        # Known modified residue mapped to standard amino acid
                        aa = index_to_one(
                            three_to_index(modified_residue_to_parent[resname])
                        )
                        sequence += aa
                        modified_residues[residue_id] = resname
                    else:
                        # Unknown modified residue
                        sequence += "X"
                        modified_residues[residue_id] = resname
                else:
                    if resname in ccd:
                        cofactors.append(resname)
                    else:
                        cofactors.append("UNK")
            if len(sequence):
                chain_to_seq[chain.id] = sequence
            if len(modified_residues):
                modified_residue_dict[chain.id] = modified_residues
    ligand_smiles_list = [ligand_smiles for _ in range(ligand_count)]
    return chain_to_seq, modified_residue_dict, cofactors, ligand_smiles_list 

For the case of 8DP2_UMA, the output of above is like;

({'B': 'HHHHHMFRIVVGLGKSGMSLVRYLARRGLPFAVVDTRENPPELATLRAQYPQVEVRCGELDAEFLCSARELYVSPGLSLRTPALVQAAAKGVRISGDIDLFAREAKAPIVAITGSNAKSTVTTLVGEMAVAADKRVAVGGNLGTPALDLLADDIELYVLELSSFQLETCDRLNAEVATVLNVSEDHMDRYDGMADYHLAKHRIFRGARQVVVNRADALTRPLIADTVPCWSFGLNKPDFKAFGLIEEDGQKWLAFQFDKLLPVGELKIRGAHNYSNALAALALGHAVGLPFDAMLGALKAFSGLAHRCQWVRERQGVSYYDDSKATNVGAALAAIEGLGADIDGKLVLLAGGDGKGADFHDLREPVARFCRAVVLLGRDAGLIAQALGNAVPLVRVATLDEAVRQAAELAREGDAVLLSPACASLDMFKNFEERGRLFAKAVEELA'},
 {'B': {199: 'KCX'}},
 ['MG', 'PE8', 'SIN'],
 ['CC(=O)N[C@H]1[C@@H](O[P@@](=O)(O)O[P@](=O)(O)OC[C@H]2O[C@@H](n3ccc(=O)[nH]c3=O)[C@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O[C@H](C)C(=O)N[C@@H](C)C(=O)O'])
  1. Wrote input fasta from the result of 1.
    I wrote CCD code for the cofactor and SMILES for ligand of interest.
    For the case of 8DP2_UMA, the resulting input fasta is like;
>protein|name=A
HHHHHMFRIVVGLGKSGMSLVRYLARRGLPFAVVDTRENPPELATLRAQYPQVEVRCGELDAEFLCSARELYVSPGLSLRTPALVQAAAKGVRISGDIDLFAREAKAPIVAITGSNAKSTVTTLVGEMAVAADKRVAVGGNLGTPALDLLADDIELYVLELSSFQLETCDRLNAEVATVLNVSEDHMDRYDGMADYHLA(KCX)HRIFRGARQVVVNRADALTRPLIADTVPCWSFGLNKPDFKAFGLIEEDGQKWLAFQFDKLLPVGELKIRGAHNYSNALAALALGHAVGLPFDAMLGALKAFSGLAHRCQWVRERQGVSYYDDSKATNVGAALAAIEGLGADIDGKLVLLAGGDGKGADFHDLREPVARFCRAVVLLGRDAGLIAQALGNAVPLVRVATLDEAVRQAAELAREGDAVLLSPACASLDMFKNFEERGRLFAKAVEELA
>ligand|name=B
MG
>ligand|name=C
PE8
>ligand|name=D
SIN
>ligand|name=E
CC(=O)N[C@H]1[C@@H](O[P@@](=O)(O)O[P@](=O)(O)OC[C@H]2O[C@@H](n3ccc(=O)[nH]c3=O)[C@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O[C@H](C)C(=O)N[C@@H](C)C(=O)O
  1. Got JackHmmer MSA on the latest version of UniProt, UniRef90, and MGnify DB and JackHmmer argument -N 1 -E 0.0001 --incE 0.0001 --F1 0.0005 --F2 0.00005 --F3 0.0000005
    Overall JackHmmer MSA process referred to the code in alphafold3
    After getting a3m files for each DB, I used merge_a3m_in_directory method, to finalize the MSA to the Chai input

  2. Infered with chai fold command.
    I modified --num-trunk-recycles arguments to 4, as you mentioned in the technical report.
    I got only 5 sample outputs with single seed.

  3. Calculated ligand RMSD as I described in the Reproduce PoseBusters score #210 (comment) ranked with confidence score.

With steps as I described above, the resulting success rate (RSMD <=2.0) was only about 0.69, much lower than the score you reported, 0.77.
Even considering that I only inferred with single seed, not 5, there's quite big gap so I think there must be problem with my inference process.
Since I successfully reproduced the success rate with calculating ligand RMSD of results in https://chaiassets.com/chai-1/paper/assets/posebusters_predictions.zip, I don't think it's a problem with how to measure ligand rmsd.

Reasons I suspect is three

  1. Wrong input fasta construction
  2. Wrong MSA construction
  3. Model checkpoint you provided as open source is inferior one of the one you used for getting the score reported.

Unfortunately, I couldn't get any more information about these.
If you can give some advises for PoseBusters reproducing steps, it will be great help for me.

Thank you!

@arogozhnikov
Copy link
Contributor

arogozhnikov commented Jan 14, 2025

Hi @Bae-SungHan

…was only about 0.69, …
Even considering that I only inferred with single seed, not 5…

We don’t have a number for a single seed, but 0.69 does not sound completely off the range.

Infered with chai fold command.

CLI is not for automation. My comment is not about quality of prediction, it is about healthy engineering practices - you’re given a nice pythonic API with detailed examples so please use the right tool for the job.

Calculated ligand RMSD as I described in the …

I guess since numbers on published predictions match, you have working code. Sanity check: run against some other baseline with known performance, like RF2AA. If you want someone to validate code - please approach posebusters’ authors. Closing that issue.

For the case of 8DP2_UMA, the resulting input fasta is like …

This example looks wrong - chai-1 accepts ligands as SMILEs; CCDs are accepted as modified AAs in proteins.
Another remark: if there are any bonds in the complex (not sure if this happens in posebusters), you need to parse those and provide them as an input.

After getting a3m files for each DB, I used merge_a3m_in_directory method, to finalize the MSA

Simple sanity check: check that uniprot sequences are paired, and for other DBs there is no pairing.

Another remark is to look at per-complex predictions, and split into bins: number of ligands, max number of copies for a ligand, max number of copies for protein, max sequence length, total number of tokens - it may reveal a helpful pattern on where the discrepancies may come from.

@Bae-SungHan
Copy link
Author

@arogozhnikov Thank you for kind response!

@Bae-SungHan
Copy link
Author

Hi.
I'd like to ask you how to prepare protein/ligand input sequences for PoseBusters.
For the example of 7SUC_COM in PoseBusters benchmark, when I parsed the sequences from cif file of biological assembly 1, I got

6 protein chains
sum of protein chains = 2472
two ligand(ccd=COM)s
However, when I parsed the sequences from pdb file you provided, I got

5 protein chains (I thought they're subset of 6 protein chains from original cif, but they're not)
sum of protein chains = 1794
only one ligand
I think you cropped or filtered chain and ligands on your rule, but I don't know what it is.
Can you give me a guide?

Thank you

@jacquesboitreaud
Copy link
Contributor

Hi @Bae-SungHan,
this case is covered in our paper https://www.biorxiv.org/content/10.1101/2024.10.10.615955v2.full.pdf:

For 4/428 structures that have more than 2048 tokens, we crop the assembly to the 2048 tokens closest to the ligand binding site

you can also see this in the prediction below (orange is our prediction of crop that you linked to vs grey - ground truth assembly 1 from PDB). The second COM ligand is not part of the crop since it is quite far from the one that was arbitrarily selected.

If you want to re-predict a crop using chai-1, you should not use interface for fasta inputs, but directly interact with run_folding_on_context, where the context should be subsampled (from full context) to 2048 closest tokens. There is some guidance in #289 if you want to implement this

Image

@Bae-SungHan
Copy link
Author

Bae-SungHan commented Jan 31, 2025

Thank you for your kind explanation!
In addition to this, I'd like to get some help about the benchmark.

  • Input sequence construction.
    I think the method I used for construct input sequences for the benchmark is quite different from yours (And it guess that's why I couldn't reproduce the score in the paper) even though I edited my method following your guide in PoseBusters inputs #254 (comment)
    Below is the how I constructed inputs for the benchmark;

    1. get first biological assembly mmcif file from RSCB PDB
    2. get protein sequences from the mmcif file w/wo modified residue through the key _entity_poly.pdbx_seq_one_letter_code and _entity_poly.pdbx_seq_one_letter_code_can for each.
      I used sequences from pdbx_seq_one_letter_code_can as JackHmmer MSA inputs, and pdbx_seq_one_letter_code as fasta input.
    3. count non-polymer entities (cofactors and target ligands) with _pdbx_nonpoly_scheme key in the mmcif file. I excluded crystallization aids ['SO4', 'GOL', 'EDO', 'PO4', 'ACT', 'PEG', 'DMS', 'TRS', 'PGE', 'PG4', 'FMT', 'EPE', 'MPD', 'MES', 'CD', 'IOD'] as you mentioned in your paper. For the cofactor (ones who're not corresponding to the target ligand) I parsed canonical SMILES from RCSB PDB API. For the ligand, I converted ligand sdf file provided by PoseBusters authors to the rdkit Mol, and converted the Mol to SMILES. I put all the cofactors/ligands in the input fasta. For example, in the case of 7SUC_COM, I parsed 7 [Mg+2] ion in the mmcif file following the method described above, so I write 7 [Mg+2] SMILES in the input fasta.

    When I followed above, I got total 10 samples whose token number is over 2048 (in the paper, there're only 4) as below;

7ECR_SIN	Too many tokens in input: 3048 > 2048. Please limit the length of the input sequence.
8CI0_8EL	Too many tokens in input: 2168 > 2048. Please limit the length of the input sequence.
8FV9_80J	Too many tokens in input: 2352 > 2048. Please limit the length of the input sequence.
7WJB_BGC	Too many tokens in input: 4698 > 2048. Please limit the length of the input sequence.
7PT3_3KK	Too many tokens in input: 2774 > 2048. Please limit the length of the input sequence.
7B2C_TP7	Too many tokens in input: 3026 > 2048. Please limit the length of the input sequence.
7M31_TDR	Too many tokens in input: 2396 > 2048. Please limit the length of the input sequence.
8G0V_YHT	Too many tokens in input: 2160 > 2048. Please limit the length of the input sequence.
7QSW_CAP	Too many tokens in input: 4912 > 2048. Please limit the length of the input sequence.

Is there anything wrong with the method I mentioned above?

  • Success score
    I inferred the benchmark with JackHmmer MSA whose -N flag is 1 & 3 each and the resulting success rate score (ligand RMSD <= 2.0) is 0.70 and 0.69 each. This result looks weird for me not only because it's far below the score in the paper (0.77) but also because the score with -N flag 3 is slightly lower than the one with -N flag 1 even though MSA with -N 3 is more rigorous one. I executed run_inference with num_recycles argument as 4, made total 25 samples with 5 random seeds and then sorted predictions with resulting confidence scores as described in the paper. Currently I guess below three factors as the cause of benchmark reproduction failure;

    1. Error in the Input sequence construction process described above (actually, I don't think it's minor reason)
    2. Error in the constructing JackHmmer MSA (I think it's also minor reason since I got the similar result between the one I locally folded and the one I got from your web server with same input fasta and MSA setting)
    3. Empty template inputs (since constructing template context is not implemented yet, I run run_inference without template inputs. (I suspect it's the main reason. But you mentioned as below in the paper, so I'm confused about this part.)

We spot-checked the difference in results between the server and our internal inference pipeline on a set of examples from the Posebusters evaluation set, finding that they differ by < 1Å RMSD in all cases (µ = 0.345Å, σ = 0.150Å), even though
the accelerated server version was run without templates

Can you give me some advice for proper benchmark test to get a similar score in the paper?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants