Large ligands, and how do I specify bond restrains for ligands? #284

VoyageHSSS · 2025-01-18T06:03:33Z

ligand size：When I work with larger ligands, it usually takes 2-3 hours before running chai-1, and then I encounter an error in the molecular conformation search. Eventually, chai-1 runs normally, but it only produces the structure of the empty protein. How can I handle this situation? Could you provide some guidance on directing the molecular conformation search code? The SMILES I input is QS-21, PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/73652135.
ligand constraints： Regarding the distance restriction between the ligand and the protein, I converted the predicted CIF file to PDB using Pybel. The atoms in the ligand are numbered as C_1, C_2, ... O_1, O_2, ... I have a prior assumption that O_1 forms a hydrogen bond with the NE2 of the histidine residue in the protein. However, according to the configuration in the example file (https://github.com/chaidiscovery/chai-lab/blob/main/examples/covalent_bonds/8cyo.restraints), I am unable to set the constraints for my system. The possible reason is that I cannot accurately find O_1 in the SMILES string, as the example uses @s for the atoms. I then forced the setting of the B-chain @O_1 as a constraint, but encountered an error when running chai-1.
Random seed: Currently, I can accurately calculate the distance between O_1 and NE2 (as shown in question2) in each PDB structure predicted by chai-1. If the distance satisfies the criteria for hydrogen bonding, I consider it reactive. However, using the default seed: 42, none of the 5 generated structures meet the requirements. Therefore, I referred to (https://github.com/chaidiscovery/chai-lab/pull/2/files) and wrote a loop for the random seed. Although this method is useful, I often need to run chai-1 multiple times to get the desired structure. I would like to know, besides the random seed, are there other parameters that can influence the ligand's pose in the pocket?

I highly recommend adding a parameter to generate multiple predicted structures (optional for users).
Finally, I look forward to your reply!

arogozhnikov · 2025-01-18T11:12:27Z

Hi @VoyageHSSS

The SMILES I input is QS-21, PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/73652135.

initial step for ligands is generation of a reference conformer - but rdkit can't generate one after 10k attempts, so it is excluded from model inputs. PubChem page also doesn't provide a reference conformer, likely because of the same issue.

Please open a discussion in RDkit if there is a way to generate some reference conformer for this molecule.

The atoms in the ligand are numbered ...

we use numbering from rdkit, it can be different.

and wrote a loop for the random seed

yup, that's how we expect you to deal with it.

can influence the ligand's pose in the pocket

All parameters matter. I recommend tweaking in this order: restrains, increase number of diffusion samples, increase number of trunk recycles.

VoyageHSSS · 2025-01-18T11:55:28Z

Thank you so much for your patient reply. I will try more diffusion samples and trunk!!

Regarding the 'restraints', I still have some questions.

After generating the mol object using rdkit.MolFromSmiles("smiles"), the first ligand's atom that needs to be restricted is numbered (C18), and the second ligand's atom that needs to be restricted is numbered (O26).

In the restraints file, the second line is set as
(B,@18,C,@26,covalent,1.0,3.0,6.0,protein-ligand,h1) / (B,@c18,C,@o26,covalent,1.0,3.0,6.0,protein-ligand,h1).
When submitted to chai-1 for prediction, it throws an error: 'AssertionError: Expect single atoms, got 0, 0'. I think I still haven't found a solution to this.

arogozhnikov · 2025-01-19T06:57:51Z

smth like below should work:

B,@C18,C,@O26,covalent,1.0,3.0,6.0,protein-ligand,restraint-1

if it doesn't, we'll need an example to debug.

VoyageHSSS · 2025-01-19T07:37:41Z

smth like below should work:
B,@C18,C,@O26,covalent,1.0,3.0,6.0,protein-ligand,restraint-1
if it doesn't, we'll need an example to debug.

here is my input_fasta:

>protein|PgUGT74AE2
MLSKTHIMFIPFPAQGHMSPMMQFAKRLAWKGVRITIVLPAQIRDSMQITNSLINTECISFDFDKDDGMPYSMQAYMGVVKLKVTNKLSDLLEKQKTNGYPVNLLVVDSLYPSRVEMCHQLGVKGAPFFTHSCAVGAIYYNAHLGKLKIPPEEGLTSVSLPSIPLLGRDDLPIIRTGTFPDLFEHLGNQFSDLDKADWIFFNTFDKLENEEAKWLSSQWPITSIGPLIPSMYLDKQLPNDKGNGINLYKADVGSCIKWLDAKDPGSVVYASFGSVKHNFGDDYMDEVAWGLLHSKYNFIWVVIEPERTKLSSDFLAEAEEKGLIVSWCPQLEVLSHKSIGSFMTHCGWNSTVEALSLGVPMVAVPQQFDQPVNAKYIVDVWQIGVRVPIGEDGVVLRGEVANCIKDVMEGEIGDELRGNALKWKGLAVEAMEKGGSSDKNIDEFISKLVSS
>ligand|UDP-Glu
O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O
>ligand|PPT
CC(=CCC[C@@](C)([C@H]1CC[C@@]2([C@@H]1[C@@H](C[C@H]3[C@]2(C[C@@H]([C@@H]4[C@@]3(CC[C@@H](C4(C)C)O)C)O)C)O)C)O)C

Thank you very much, this has been bothering me for a long time.

arogozhnikov · 2025-01-19T09:41:05Z

Your indices are off, e.g. first molecule doesn't have 18 carbons, second doesn't have 26 oxygens - again, see how rdkit enumerates atoms.

When I run your example with something more reasonable, like this:

A,@C10,B,@O3,contact,1.0,0.0,5.5,protein-heavy,restraint_1

the inputs are processed normally.

VoyageHSSS · 2025-01-19T11:54:46Z

my code for get atom_id:
mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O') for atom in mol.GetAtoms(): atom.SetProp("atomNote", str(atom.GetIdx())) # print(atom.GetIdx(), atom.GetSymbol(), atom.GetProp("atomNote")) img_substrate = Draw.MolToImage(mol, size=(1200, 1200)) plt.imshow(img_substrate)

the output:

chai-1 code:
chai fold AmGT1-UDP-Glu-20\(S\)-PPT.fa ./output --no-use-msa-server --num-trunk-recycles 3 --num-diffn-timesteps 200 --seed 42 --constraint-path restraints

I thought it was a chain ID issue, but even after changing the chain, the problem still persists.
restrains
chainA,res_idxA,chainB,res_idxB,connection_type,confidence,min_distance_angstrom,max_distance_angstrom,comment,restraint_id A,@C10,B,@O3,covalent,1.0,0.0,6.0,protein-ligand,h1
chainA,res_idxA,chainB,res_idxB,connection_type,confidence,min_distance_angstrom,max_distance_angstrom,comment,restraint_id B,@C10,C,@O3,covalent,1.0,0.0,6.0,protein-ligand,h1

output：

My chai-1 version is 0.5.1. Could it be a version issue?

Your indices are off, e.g. first molecule doesn't have 18 carbons, second doesn't have 26 oxygens - again, see how rdkit enumerates atoms.

When I run your example with something more reasonable, like this:
A,@C10,B,@O3,contact,1.0,0.0,5.5,protein-heavy,restraint_1
the inputs are processed normally.

arogozhnikov · 2025-01-19T19:39:23Z

this is code we use for indexing:

chai-lab/chai_lab/data/sources/rdkit.py

Lines 162 to 166 in c813769

    
           element_counter: dict = defaultdict(int) 
        
           for atom in mol_with_hs.GetAtoms(): 
        
               elem = atom.GetSymbol() 
        
               element_counter[elem] += 1  # Start each counter at 1 
        
               atom.SetProp("name", elem + str(element_counter[elem]))

it enumerates each type of atom independently starting from one, e.g. O1, O2, O3, or C1, C2, C3.

Here is the code for enumeration:

from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import Draw
from matplotlib import pyplot as plt

mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O') 

element_counter: dict = defaultdict(int)
for atom in mol.GetAtoms():
    elem = atom.GetSymbol()
    element_counter[elem] += 1  # Start each counter at 1
    name = elem + str(element_counter[elem])
    atom.SetProp("name", name)     # will be used by downstream code
    atom.SetProp("atomNote", name) # for plotting

img_substrate = Draw.MolToImage(mol, size=(1200, 1200), ) 
plt.imshow(img_substrate)

and result for your molecule:

VoyageHSSS · 2025-01-20T02:38:49Z

Thank you very much, I will try according to this code! Alex Rogozhnikov ***@***.***>于2025年1月20日周一03:39写道：

…

this is code we use for indexing: https://github.com/chaidiscovery/chai-lab/blob/c8137690c66565b433cfbf8b97df351443822684/chai_lab/data/sources/rdkit.py#L162-L166 it enumerates each type of atom independently starting from one, e.g. O1, O2, O3, or C1, C2, C3. Here is the code for enumeration: from collections import defaultdict from rdkit import Chem from rdkit.Chem import Draw from matplotlib import pyplot as plt mol = ***@***.******@***.******@***.***(O)[C@@h](COP(O)(OP(O)(O[C@@***@***.***(O)[C@@***@***.***(O)[C@@h](CO)O3)=O)=O)O2)C=CC1=O') element_counter: dict = defaultdict(int) for atom in mol.GetAtoms(): elem = atom.GetSymbol() element_counter[elem] += 1 # Start each counter at 1 name = elem + str(element_counter[elem]) atom.SetProp("name", name) # will be used by downstream code atom.SetProp("atomNote", name) # for plotting img_substrate = Draw.MolToImage(mol, size=(1200, 1200), ) plt.imshow(img_substrate) and result for your molecule: affc4171-ce0e-4195-8124-504d117d5850.png (view on web) <https://github.com/user-attachments/assets/284a0cfe-a9bb-44ba-9193-a014400eeb09> — Reply to this email directly, view it on GitHub <#284 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAQZMYVKQHJAWVQ62JM4M3D2LP5QDAVCNFSM6AAAAABVNHK6JGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBQHE4TINRQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

DomML · 2025-02-24T10:53:03Z

from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import Draw
from matplotlib import pyplot as plt

mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2C@H C@H C@@HO2)C=CC1=O')

element_counter: dict = defaultdict(int)
for atom in mol.GetAtoms():
elem = atom.GetSymbol()
element_counter[elem] += 1 # Start each counter at 1
name = elem + str(element_counter[elem])
atom.SetProp("name", name) # will be used by downstream code
atom.SetProp("atomNote", name) # for plotting

img_substrate = Draw.MolToImage(mol, size=(1200, 1200), )
plt.imshow(img_substrate)

This code should be in the doc, too useful to be hiden in the issues ^^

arogozhnikov changed the title ~~Issues related to ligand size, ligand constraints, and random seed~~ Large ligands, and how do I specify bond restrains for ligands? Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large ligands, and how do I specify bond restrains for ligands? #284

Large ligands, and how do I specify bond restrains for ligands? #284

VoyageHSSS commented Jan 18, 2025

arogozhnikov commented Jan 18, 2025

VoyageHSSS commented Jan 18, 2025

arogozhnikov commented Jan 19, 2025 •

edited

Loading

VoyageHSSS commented Jan 19, 2025

arogozhnikov commented Jan 19, 2025

VoyageHSSS commented Jan 19, 2025

arogozhnikov commented Jan 19, 2025 •

edited

Loading

VoyageHSSS commented Jan 20, 2025 via email

DomML commented Feb 24, 2025

Large ligands, and how do I specify bond restrains for ligands? #284

Large ligands, and how do I specify bond restrains for ligands? #284

Comments

VoyageHSSS commented Jan 18, 2025

arogozhnikov commented Jan 18, 2025

VoyageHSSS commented Jan 18, 2025

arogozhnikov commented Jan 19, 2025 • edited Loading

VoyageHSSS commented Jan 19, 2025

arogozhnikov commented Jan 19, 2025

VoyageHSSS commented Jan 19, 2025

arogozhnikov commented Jan 19, 2025 • edited Loading

VoyageHSSS commented Jan 20, 2025 via email

DomML commented Feb 24, 2025

arogozhnikov commented Jan 19, 2025 •

edited

Loading

arogozhnikov commented Jan 19, 2025 •

edited

Loading