Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large ligands, and how do I specify bond restrains for ligands? #284

Open
VoyageHSSS opened this issue Jan 18, 2025 · 9 comments
Open

Large ligands, and how do I specify bond restrains for ligands? #284

VoyageHSSS opened this issue Jan 18, 2025 · 9 comments

Comments

@VoyageHSSS
Copy link

  1. ligand size:When I work with larger ligands, it usually takes 2-3 hours before running chai-1, and then I encounter an error in the molecular conformation search. Eventually, chai-1 runs normally, but it only produces the structure of the empty protein. How can I handle this situation? Could you provide some guidance on directing the molecular conformation search code? The SMILES I input is QS-21, PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/73652135.
  2. ligand constraints: Regarding the distance restriction between the ligand and the protein, I converted the predicted CIF file to PDB using Pybel. The atoms in the ligand are numbered as C_1, C_2, ... O_1, O_2, ... I have a prior assumption that O_1 forms a hydrogen bond with the NE2 of the histidine residue in the protein. However, according to the configuration in the example file (https://github.com/chaidiscovery/chai-lab/blob/main/examples/covalent_bonds/8cyo.restraints), I am unable to set the constraints for my system. The possible reason is that I cannot accurately find O_1 in the SMILES string, as the example uses @s for the atoms. I then forced the setting of the B-chain @O_1 as a constraint, but encountered an error when running chai-1.
  3. Random seed: Currently, I can accurately calculate the distance between O_1 and NE2 (as shown in question2) in each PDB structure predicted by chai-1. If the distance satisfies the criteria for hydrogen bonding, I consider it reactive. However, using the default seed: 42, none of the 5 generated structures meet the requirements. Therefore, I referred to (https://github.com/chaidiscovery/chai-lab/pull/2/files) and wrote a loop for the random seed. Although this method is useful, I often need to run chai-1 multiple times to get the desired structure. I would like to know, besides the random seed, are there other parameters that can influence the ligand's pose in the pocket?

I highly recommend adding a parameter to generate multiple predicted structures (optional for users).
Finally, I look forward to your reply!

@arogozhnikov
Copy link
Contributor

Hi @VoyageHSSS

The SMILES I input is QS-21, PubChem: https://pubchem.ncbi.nlm.nih.gov/compound/73652135.

initial step for ligands is generation of a reference conformer - but rdkit can't generate one after 10k attempts, so it is excluded from model inputs. PubChem page also doesn't provide a reference conformer, likely because of the same issue.

Please open a discussion in RDkit if there is a way to generate some reference conformer for this molecule.

The atoms in the ligand are numbered ...

we use numbering from rdkit, it can be different.

and wrote a loop for the random seed

yup, that's how we expect you to deal with it.

can influence the ligand's pose in the pocket

All parameters matter. I recommend tweaking in this order: restrains, increase number of diffusion samples, increase number of trunk recycles.

@VoyageHSSS
Copy link
Author

Thank you so much for your patient reply. I will try more diffusion samples and trunk!!

Regarding the 'restraints', I still have some questions.

After generating the mol object using rdkit.MolFromSmiles("smiles"), the first ligand's atom that needs to be restricted is numbered (C18), and the second ligand's atom that needs to be restricted is numbered (O26).

In the restraints file, the second line is set as
(B,@18,C,@26,covalent,1.0,3.0,6.0,protein-ligand,h1) / (B,@c18,C,@o26,covalent,1.0,3.0,6.0,protein-ligand,h1).
When submitted to chai-1 for prediction, it throws an error: 'AssertionError: Expect single atoms, got 0, 0'. I think I still haven't found a solution to this.

@arogozhnikov
Copy link
Contributor

arogozhnikov commented Jan 19, 2025

smth like below should work:

B,@C18,C,@O26,covalent,1.0,3.0,6.0,protein-ligand,restraint-1

if it doesn't, we'll need an example to debug.

@VoyageHSSS
Copy link
Author

smth like below should work:

B,@C18,C,@O26,covalent,1.0,3.0,6.0,protein-ligand,restraint-1

if it doesn't, we'll need an example to debug.

here is my input_fasta:

>protein|PgUGT74AE2
MLSKTHIMFIPFPAQGHMSPMMQFAKRLAWKGVRITIVLPAQIRDSMQITNSLINTECISFDFDKDDGMPYSMQAYMGVVKLKVTNKLSDLLEKQKTNGYPVNLLVVDSLYPSRVEMCHQLGVKGAPFFTHSCAVGAIYYNAHLGKLKIPPEEGLTSVSLPSIPLLGRDDLPIIRTGTFPDLFEHLGNQFSDLDKADWIFFNTFDKLENEEAKWLSSQWPITSIGPLIPSMYLDKQLPNDKGNGINLYKADVGSCIKWLDAKDPGSVVYASFGSVKHNFGDDYMDEVAWGLLHSKYNFIWVVIEPERTKLSSDFLAEAEEKGLIVSWCPQLEVLSHKSIGSFMTHCGWNSTVEALSLGVPMVAVPQQFDQPVNAKYIVDVWQIGVRVPIGEDGVVLRGEVANCIKDVMEGEIGDELRGNALKWKGLAVEAMEKGGSSDKNIDEFISKLVSS
>ligand|UDP-Glu
O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O
>ligand|PPT
CC(=CCC[C@@](C)([C@H]1CC[C@@]2([C@@H]1[C@@H](C[C@H]3[C@]2(C[C@@H]([C@@H]4[C@@]3(CC[C@@H](C4(C)C)O)C)O)C)O)C)O)C

Thank you very much, this has been bothering me for a long time.

@arogozhnikov
Copy link
Contributor

Your indices are off, e.g. first molecule doesn't have 18 carbons, second doesn't have 26 oxygens - again, see how rdkit enumerates atoms.

When I run your example with something more reasonable, like this:

A,@C10,B,@O3,contact,1.0,0.0,5.5,protein-heavy,restraint_1

the inputs are processed normally.

@arogozhnikov arogozhnikov changed the title Issues related to ligand size, ligand constraints, and random seed Large ligands, and how do I specify bond restrains for ligands? Jan 19, 2025
@VoyageHSSS
Copy link
Author

my code for get atom_id:
mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O') for atom in mol.GetAtoms(): atom.SetProp("atomNote", str(atom.GetIdx())) # print(atom.GetIdx(), atom.GetSymbol(), atom.GetProp("atomNote")) img_substrate = Draw.MolToImage(mol, size=(1200, 1200)) plt.imshow(img_substrate)

the output:

Image

chai-1 code:
chai fold AmGT1-UDP-Glu-20\(S\)-PPT.fa ./output --no-use-msa-server --num-trunk-recycles 3 --num-diffn-timesteps 200 --seed 42 --constraint-path restraints

I thought it was a chain ID issue, but even after changing the chain, the problem still persists.
restrains
chainA,res_idxA,chainB,res_idxB,connection_type,confidence,min_distance_angstrom,max_distance_angstrom,comment,restraint_id A,@C10,B,@O3,covalent,1.0,0.0,6.0,protein-ligand,h1
chainA,res_idxA,chainB,res_idxB,connection_type,confidence,min_distance_angstrom,max_distance_angstrom,comment,restraint_id B,@C10,C,@O3,covalent,1.0,0.0,6.0,protein-ligand,h1

output:

Image

My chai-1 version is 0.5.1. Could it be a version issue?

Your indices are off, e.g. first molecule doesn't have 18 carbons, second doesn't have 26 oxygens - again, see how rdkit enumerates atoms.

When I run your example with something more reasonable, like this:

A,@C10,B,@O3,contact,1.0,0.0,5.5,protein-heavy,restraint_1

the inputs are processed normally.

@arogozhnikov
Copy link
Contributor

arogozhnikov commented Jan 19, 2025

this is code we use for indexing:

element_counter: dict = defaultdict(int)
for atom in mol_with_hs.GetAtoms():
elem = atom.GetSymbol()
element_counter[elem] += 1 # Start each counter at 1
atom.SetProp("name", elem + str(element_counter[elem]))

it enumerates each type of atom independently starting from one, e.g. O1, O2, O3, or C1, C2, C3.

Here is the code for enumeration:

from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import Draw
from matplotlib import pyplot as plt

mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2[C@H](O)[C@H](O)[C@@H](COP(O)(OP(O)(O[C@@H]3[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O3)=O)=O)O2)C=CC1=O') 

element_counter: dict = defaultdict(int)
for atom in mol.GetAtoms():
    elem = atom.GetSymbol()
    element_counter[elem] += 1  # Start each counter at 1
    name = elem + str(element_counter[elem])
    atom.SetProp("name", name)     # will be used by downstream code
    atom.SetProp("atomNote", name) # for plotting

img_substrate = Draw.MolToImage(mol, size=(1200, 1200), ) 
plt.imshow(img_substrate)

and result for your molecule:

Image

@VoyageHSSS
Copy link
Author

VoyageHSSS commented Jan 20, 2025 via email

@DomML
Copy link

DomML commented Feb 24, 2025

from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import Draw
from matplotlib import pyplot as plt

mol = Chem.MolFromSmiles('O=C(N1)N([C@H]2C@HC@HC@@HO2)C=CC1=O')

element_counter: dict = defaultdict(int)
for atom in mol.GetAtoms():
elem = atom.GetSymbol()
element_counter[elem] += 1 # Start each counter at 1
name = elem + str(element_counter[elem])
atom.SetProp("name", name) # will be used by downstream code
atom.SetProp("atomNote", name) # for plotting

img_substrate = Draw.MolToImage(mol, size=(1200, 1200), )
plt.imshow(img_substrate)

This code should be in the doc, too useful to be hiden in the issues ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@arogozhnikov @DomML @VoyageHSSS and others