Reproducibility of benchmark evaluation #280

Bae-SungHan · 2025-01-13T05:04:22Z

Hi
I've been trying to reproduce PoseBusters benchmark score you reported, but unfortunately I haven't been able to do so yet. So I would like to ask if I could get help with this.
Here's my process to infer PoseBusters dataset locally;

Directly parsed protein sequence, modified residue info, cofactor list, and number of ligand of interest from predicted pdb file provided in https://chaiassets.com/chai-1/paper/assets/posebusters_predictions.zip.
Below is the code I used;

def parse_infos(pdb_file, ligand_ccd, ligand_smiles):
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure("structure", pdb_file)

    chain_to_seq = {}
    modified_residue_dict = {}
    cofactors = []
    ligand_count = 0
    for model in structure:
        for chain in model:
            sequence = ""
            modified_residues = dict()
            for residue_id, residue in enumerate(chain):
                resname = residue.get_resname()
                if resname == ligand_ccd:
                    ligand_count += 1
                elif is_aa(residue, standard=False):
                    if resname in standard_aa_names:
                        aa = index_to_one(three_to_index(resname))
                        sequence += aa
                    elif resname in modified_residue_to_parent:
                        # Known modified residue mapped to standard amino acid
                        aa = index_to_one(
                            three_to_index(modified_residue_to_parent[resname])
                        )
                        sequence += aa
                        modified_residues[residue_id] = resname
                    else:
                        # Unknown modified residue
                        sequence += "X"
                        modified_residues[residue_id] = resname
                else:
                    if resname in ccd:
                        cofactors.append(resname)
                    else:
                        cofactors.append("UNK")
            if len(sequence):
                chain_to_seq[chain.id] = sequence
            if len(modified_residues):
                modified_residue_dict[chain.id] = modified_residues
    ligand_smiles_list = [ligand_smiles for _ in range(ligand_count)]
    return chain_to_seq, modified_residue_dict, cofactors, ligand_smiles_list

For the case of 8DP2_UMA, the output of above is like;

({'B': 'HHHHHMFRIVVGLGKSGMSLVRYLARRGLPFAVVDTRENPPELATLRAQYPQVEVRCGELDAEFLCSARELYVSPGLSLRTPALVQAAAKGVRISGDIDLFAREAKAPIVAITGSNAKSTVTTLVGEMAVAADKRVAVGGNLGTPALDLLADDIELYVLELSSFQLETCDRLNAEVATVLNVSEDHMDRYDGMADYHLAKHRIFRGARQVVVNRADALTRPLIADTVPCWSFGLNKPDFKAFGLIEEDGQKWLAFQFDKLLPVGELKIRGAHNYSNALAALALGHAVGLPFDAMLGALKAFSGLAHRCQWVRERQGVSYYDDSKATNVGAALAAIEGLGADIDGKLVLLAGGDGKGADFHDLREPVARFCRAVVLLGRDAGLIAQALGNAVPLVRVATLDEAVRQAAELAREGDAVLLSPACASLDMFKNFEERGRLFAKAVEELA'},
 {'B': {199: 'KCX'}},
 ['MG', 'PE8', 'SIN'],
 ['CC(=O)N[C@H]1[C@@H](O[P@@](=O)(O)O[P@](=O)(O)OC[C@H]2O[C@@H](n3ccc(=O)[nH]c3=O)[C@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O[C@H](C)C(=O)N[C@@H](C)C(=O)O'])

Wrote input fasta from the result of 1.
I wrote CCD code for the cofactor and SMILES for ligand of interest.
For the case of 8DP2_UMA, the resulting input fasta is like;

>protein|name=A
HHHHHMFRIVVGLGKSGMSLVRYLARRGLPFAVVDTRENPPELATLRAQYPQVEVRCGELDAEFLCSARELYVSPGLSLRTPALVQAAAKGVRISGDIDLFAREAKAPIVAITGSNAKSTVTTLVGEMAVAADKRVAVGGNLGTPALDLLADDIELYVLELSSFQLETCDRLNAEVATVLNVSEDHMDRYDGMADYHLA(KCX)HRIFRGARQVVVNRADALTRPLIADTVPCWSFGLNKPDFKAFGLIEEDGQKWLAFQFDKLLPVGELKIRGAHNYSNALAALALGHAVGLPFDAMLGALKAFSGLAHRCQWVRERQGVSYYDDSKATNVGAALAAIEGLGADIDGKLVLLAGGDGKGADFHDLREPVARFCRAVVLLGRDAGLIAQALGNAVPLVRVATLDEAVRQAAELAREGDAVLLSPACASLDMFKNFEERGRLFAKAVEELA
>ligand|name=B
MG
>ligand|name=C
PE8
>ligand|name=D
SIN
>ligand|name=E
CC(=O)N[C@H]1[C@@H](O[P@@](=O)(O)O[P@](=O)(O)OC[C@H]2O[C@@H](n3ccc(=O)[nH]c3=O)[C@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O[C@H](C)C(=O)N[C@@H](C)C(=O)O

Got JackHmmer MSA on the latest version of UniProt, UniRef90, and MGnify DB and JackHmmer argument -N 1 -E 0.0001 --incE 0.0001 --F1 0.0005 --F2 0.00005 --F3 0.0000005
Overall JackHmmer MSA process referred to the code in alphafold3
After getting a3m files for each DB, I used merge_a3m_in_directory method, to finalize the MSA to the Chai input
Infered with chai fold command.
I modified --num-trunk-recycles arguments to 4, as you mentioned in the technical report.
I got only 5 sample outputs with single seed.
Calculated ligand RMSD as I described in the Reproduce PoseBusters score #210 (comment) ranked with confidence score.

With steps as I described above, the resulting success rate (RSMD <=2.0) was only about 0.69, much lower than the score you reported, 0.77.
Even considering that I only inferred with single seed, not 5, there's quite big gap so I think there must be problem with my inference process.
Since I successfully reproduced the success rate with calculating ligand RMSD of results in https://chaiassets.com/chai-1/paper/assets/posebusters_predictions.zip, I don't think it's a problem with how to measure ligand rmsd.

Reasons I suspect is three

Wrong input fasta construction
Wrong MSA construction
Model checkpoint you provided as open source is inferior one of the one you used for getting the score reported.

Unfortunately, I couldn't get any more information about these.
If you can give some advises for PoseBusters reproducing steps, it will be great help for me.

Thank you!

The text was updated successfully, but these errors were encountered:

arogozhnikov · 2025-01-14T03:18:14Z

Hi @Bae-SungHan

…was only about 0.69, …
Even considering that I only inferred with single seed, not 5…

We don’t have a number for a single seed, but 0.69 does not sound completely off the range.

Infered with chai fold command.

CLI is not for automation. My comment is not about quality of prediction, it is about healthy engineering practices - you’re given a nice pythonic API with detailed examples so please use the right tool for the job.

Calculated ligand RMSD as I described in the …

I guess since numbers on published predictions match, you have working code. Sanity check: run against some other baseline with known performance, like RF2AA. If you want someone to validate code - please approach posebusters’ authors. Closing that issue.

For the case of 8DP2_UMA, the resulting input fasta is like …

This example looks wrong - chai-1 accepts ligands as SMILEs; CCDs are accepted as modified AAs in proteins.
Another remark: if there are any bonds in the complex (not sure if this happens in posebusters), you need to parse those and provide them as an input.

After getting a3m files for each DB, I used merge_a3m_in_directory method, to finalize the MSA

Simple sanity check: check that uniprot sequences are paired, and for other DBs there is no pairing.

Another remark is to look at per-complex predictions, and split into bins: number of ligands, max number of copies for a ligand, max number of copies for protein, max sequence length, total number of tokens - it may reveal a helpful pattern on where the discrepancies may come from.

Bae-SungHan · 2025-01-14T04:47:34Z

@arogozhnikov Thank you for kind response!

Bae-SungHan · 2025-01-24T09:00:04Z

Hi.
I'd like to ask you how to prepare protein/ligand input sequences for PoseBusters.
For the example of 7SUC_COM in PoseBusters benchmark, when I parsed the sequences from cif file of biological assembly 1, I got

6 protein chains
sum of protein chains = 2472
two ligand(ccd=COM)s
However, when I parsed the sequences from pdb file you provided, I got

5 protein chains (I thought they're subset of 6 protein chains from original cif, but they're not)
sum of protein chains = 1794
only one ligand
I think you cropped or filtered chain and ligands on your rule, but I don't know what it is.
Can you give me a guide?

Thank you

jacquesboitreaud · 2025-01-24T19:38:10Z

Hi @Bae-SungHan,
this case is covered in our paper https://www.biorxiv.org/content/10.1101/2024.10.10.615955v2.full.pdf:

For 4/428 structures that have more than 2048 tokens, we crop the assembly to the 2048 tokens closest to the ligand binding site

you can also see this in the prediction below (orange is our prediction of crop that you linked to vs grey - ground truth assembly 1 from PDB). The second COM ligand is not part of the crop since it is quite far from the one that was arbitrarily selected.

If you want to re-predict a crop using chai-1, you should not use interface for fasta inputs, but directly interact with run_folding_on_context, where the context should be subsampled (from full context) to 2048 closest tokens. There is some guidance in #289 if you want to implement this

Bae-SungHan · 2025-01-31T07:03:59Z

Thank you for your kind explanation!
In addition to this, I'd like to get some help about the benchmark.

Input sequence construction.
I think the method I used for construct input sequences for the benchmark is quite different from yours (And it guess that's why I couldn't reproduce the score in the paper) even though I edited my method following your guide in PoseBusters inputs #254 (comment)
Below is the how I constructed inputs for the benchmark;
1. get first biological assembly mmcif file from RSCB PDB
2. get protein sequences from the mmcif file w/wo modified residue through the key _entity_poly.pdbx_seq_one_letter_code and _entity_poly.pdbx_seq_one_letter_code_can for each.
  I used sequences from pdbx_seq_one_letter_code_can as JackHmmer MSA inputs, and pdbx_seq_one_letter_code as fasta input.
3. count non-polymer entities (cofactors and target ligands) with _pdbx_nonpoly_scheme key in the mmcif file. I excluded crystallization aids ['SO4', 'GOL', 'EDO', 'PO4', 'ACT', 'PEG', 'DMS', 'TRS', 'PGE', 'PG4', 'FMT', 'EPE', 'MPD', 'MES', 'CD', 'IOD'] as you mentioned in your paper. For the cofactor (ones who're not corresponding to the target ligand) I parsed canonical SMILES from RCSB PDB API. For the ligand, I converted ligand sdf file provided by PoseBusters authors to the rdkit Mol, and converted the Mol to SMILES. I put all the cofactors/ligands in the input fasta. For example, in the case of 7SUC_COM, I parsed 7 [Mg+2] ion in the mmcif file following the method described above, so I write 7 [Mg+2] SMILES in the input fasta.
When I followed above, I got total 10 samples whose token number is over 2048 (in the paper, there're only 4) as below;

7ECR_SIN	Too many tokens in input: 3048 > 2048. Please limit the length of the input sequence.
8CI0_8EL	Too many tokens in input: 2168 > 2048. Please limit the length of the input sequence.
8FV9_80J	Too many tokens in input: 2352 > 2048. Please limit the length of the input sequence.
7WJB_BGC	Too many tokens in input: 4698 > 2048. Please limit the length of the input sequence.
7PT3_3KK	Too many tokens in input: 2774 > 2048. Please limit the length of the input sequence.
7B2C_TP7	Too many tokens in input: 3026 > 2048. Please limit the length of the input sequence.
7M31_TDR	Too many tokens in input: 2396 > 2048. Please limit the length of the input sequence.
8G0V_YHT	Too many tokens in input: 2160 > 2048. Please limit the length of the input sequence.
7QSW_CAP	Too many tokens in input: 4912 > 2048. Please limit the length of the input sequence.

Is there anything wrong with the method I mentioned above?

Success score
I inferred the benchmark with JackHmmer MSA whose -N flag is 1 & 3 each and the resulting success rate score (ligand RMSD <= 2.0) is 0.70 and 0.69 each. This result looks weird for me not only because it's far below the score in the paper (0.77) but also because the score with -N flag 3 is slightly lower than the one with -N flag 1 even though MSA with -N 3 is more rigorous one. I executed run_inference with num_recycles argument as 4, made total 25 samples with 5 random seeds and then sorted predictions with resulting confidence scores as described in the paper. Currently I guess below three factors as the cause of benchmark reproduction failure;
1. Error in the Input sequence construction process described above (actually, I don't think it's minor reason)
2. Error in the constructing JackHmmer MSA (I think it's also minor reason since I got the similar result between the one I locally folded and the one I got from your web server with same input fasta and MSA setting)
3. Empty template inputs (since constructing template context is not implemented yet, I run run_inference without template inputs. (I suspect it's the main reason. But you mentioned as below in the paper, so I'm confused about this part.)

We spot-checked the difference in results between the server and our internal inference pipeline on a set of examples from the Posebusters evaluation set, finding that they differ by < 1Å RMSD in all cases (µ = 0.345Å, σ = 0.150Å), even though
the accelerated server version was run without templates

Can you give me some advice for proper benchmark test to get a similar score in the paper?

Thank you!

arogozhnikov mentioned this issue Jan 14, 2025

Reproduce PoseBusters score #210

Closed

arogozhnikov mentioned this issue Jan 24, 2025

filtering chain & ligand inputs #293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility of benchmark evaluation #280

Reproducibility of benchmark evaluation #280

Bae-SungHan commented Jan 13, 2025 •

edited

Loading

arogozhnikov commented Jan 14, 2025 •

edited

Loading

Bae-SungHan commented Jan 14, 2025

Bae-SungHan commented Jan 24, 2025

jacquesboitreaud commented Jan 24, 2025

Bae-SungHan commented Jan 31, 2025 •

edited

Loading

Reproducibility of benchmark evaluation #280

Reproducibility of benchmark evaluation #280

Comments

Bae-SungHan commented Jan 13, 2025 • edited Loading

arogozhnikov commented Jan 14, 2025 • edited Loading

Bae-SungHan commented Jan 14, 2025

Bae-SungHan commented Jan 24, 2025

jacquesboitreaud commented Jan 24, 2025

Bae-SungHan commented Jan 31, 2025 • edited Loading

Bae-SungHan commented Jan 13, 2025 •

edited

Loading

arogozhnikov commented Jan 14, 2025 •

edited

Loading

Bae-SungHan commented Jan 31, 2025 •

edited

Loading