-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SMILES translation for surface adsorbates #2701
base: main
Are you sure you want to change the base?
Conversation
You can use a * to represent a surface site in a SMILES string when reading it via RDKit, and RDKit turns this into a dummy atom with atomic number 0. By doing the reverse (telling RDKit that our surface sites have atomic number 0) we can use RDKit to *generate* SMILES strings in the same format, enabling a round trip. But for other things like InChIs it seems more robust to use an atom like Platinum (78). This allows both, with default 0, but 78 used in InChI conversion.
Using the new syntax with * for a surface site.
Unfortunately going from a molecule TO a smiles uses OpenBabel if you have Nitrogen in the molecule, which then uses [Pt] in place of *. But you can still READ smiles with * and N in. That means you don't get a round trip. In [9]: Molecule(smiles='CNC*').to_smiles() Out[9]: 'CNC[Pt]' Still, this is better than it was. (I think).
This means they can be parsed in a round trip by RDKit (the default SMILES reader). This is handy because OpenBabel is the default SMILES *writer* for things with an N atom, but not everything. Now it's more consistent, outputting a * for a surface site. I added a unit test for round-trip conversion to and from SMILES a few times for various adsorbates including some with N.
@mjohnson541 we were just discussing how RMS uses SMILES for some things and uses either RDKit or RMG. Will this have impacts on RMS that need coordination? |
I don't think so, right now RMS can't do anything with surface smiles, this should allow RMS to accept smiles for surface species. |
Regression Testing ResultsWARNING:root:Initial mole fractions do not sum to one; normalizing. Detailed regression test results.Regression test aromatics:Reference: Execution time (DD:HH:MM:SS): 00:00:01:06 aromatics Passed Core Comparison ✅Original model has 15 species. aromatics Failed Edge Comparison ❌Original model has 106 species. Non-identical thermo! ❌
Identical thermo comments: Non-identical thermo! ❌
thermo: Thermo group additivity estimation: group(Cs-(Cds-Cds)(Cds-Cds)(Cds-Cds)H) + group(Cds-Cds(Cds-Cds)(Cds-Cds)) + group(Cds-CdsCsH) + group(Cds-CdsCsH) + group(Cds-Cds(Cds-Cds)H) + group(Cds-Cds(Cds-Cds)H) + group(Cds-CdsCsH) + group(Cdd-CdsCds) + Estimated bicyclic component: polycyclic(s4_6_6_ane) - ring(Cyclohexane) - ring(Cyclohexane) + ring(124cyclohexatriene) + ring(1,4-Cyclohexadiene) Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics: Non-identical kinetics! ❌
kinetics:
Observables Test Case: Aromatics Comparison
✅ All Observables varied by less than 0.500 on average between old model and new model in all conditions! aromatics Passed Observable Testing ✅Regression test liquid_oxidation:Reference: Execution time (DD:HH:MM:SS): 00:00:02:10 liquid_oxidation Failed Core Comparison ❌Original model has 37 species. liquid_oxidation Failed Edge Comparison ❌Original model has 202 species.
Observables Test Case: liquid_oxidation Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! liquid_oxidation Passed Observable Testing ✅Regression test nitrogen:Reference: Execution time (DD:HH:MM:SS): 00:00:01:28 nitrogen Failed Core Comparison ❌Original model has 41 species. nitrogen Failed Edge Comparison ❌Original model has 133 species.
Observables Test Case: NC Comparison
✅ All Observables varied by less than 0.200 on average between old model and new model in all conditions! nitrogen Passed Observable Testing ✅Regression test oxidation:Reference: Execution time (DD:HH:MM:SS): 00:00:02:27 oxidation Passed Core Comparison ✅Original model has 59 species. oxidation Passed Edge Comparison ✅Original model has 230 species.
Observables Test Case: Oxidation Comparison
✅ All Observables varied by less than 0.500 on average between old model and new model in all conditions! oxidation Passed Observable Testing ✅Regression test sulfur:Reference: Execution time (DD:HH:MM:SS): 00:00:00:56 sulfur Passed Core Comparison ✅Original model has 27 species. sulfur Failed Edge Comparison ❌Original model has 89 species.
Observables Test Case: SO2 Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! sulfur Passed Observable Testing ✅Regression test superminimal:Reference: Execution time (DD:HH:MM:SS): 00:00:00:39 superminimal Passed Core Comparison ✅Original model has 13 species. superminimal Passed Edge Comparison ✅Original model has 18 species. Regression test RMS_constantVIdealGasReactor_superminimal:Reference: Execution time (DD:HH:MM:SS): 00:00:02:22 RMS_constantVIdealGasReactor_superminimal Passed Core Comparison ✅Original model has 13 species. RMS_constantVIdealGasReactor_superminimal Passed Edge Comparison ✅Original model has 13 species.
Observables Test Case: RMS_constantVIdealGasReactor_superminimal Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! RMS_constantVIdealGasReactor_superminimal Passed Observable Testing ✅Regression test RMS_CSTR_liquid_oxidation:Reference: Execution time (DD:HH:MM:SS): 00:00:05:53 RMS_CSTR_liquid_oxidation Passed Core Comparison ✅Original model has 37 species. RMS_CSTR_liquid_oxidation Passed Edge Comparison ✅Original model has 206 species.
Observables Test Case: RMS_CSTR_liquid_oxidation Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! RMS_CSTR_liquid_oxidation Passed Observable Testing ✅Regression test fragment:Reference: Execution time (DD:HH:MM:SS): 00:00:00:43 fragment Passed Core Comparison ✅Original model has 10 species. fragment Passed Edge Comparison ✅Original model has 33 species.
Observables Test Case: fragment Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! fragment Passed Observable Testing ✅Regression test RMS_constantVIdealGasReactor_fragment:Reference: Execution time (DD:HH:MM:SS): 00:00:03:04 RMS_constantVIdealGasReactor_fragment Passed Core Comparison ✅Original model has 10 species. RMS_constantVIdealGasReactor_fragment Passed Edge Comparison ✅Original model has 27 species.
Observables Test Case: RMS_constantVIdealGasReactor_fragment Comparison
✅ All Observables varied by less than 0.100 on average between old model and new model in all conditions! RMS_constantVIdealGasReactor_fragment Passed Observable Testing ✅beep boop this comment was written by a bot 🤖 |
Chatting with Bjarne, we thought that having Nice to have?
extend the list? |
This pull request is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant pull request, otherwise it will automatically be closed in 30 days. |
Motivation or Problem
I recently discovered (a happy accident) that you can put
*
in a SMILES string and RDKit will use it as a dummy atom, with an atomic number of 0, which RMG happens to interpret as a surface site, so you can easily enter adsorbates as SMILES strings this way. (Was I the last to realize this? seems super convenient!).Anyway, the round trip didn't work, because it would output the atom as
[Pt]
when you convert things into a SMILES string.Description of Changes
This PR changes it so you can do round-trip read and write SMILES with
*
for surface sites.The default atom for a surface site being turned into an RDKit molecule is now the wildcard 0. When you're dealing with InChI conversion it instead uses atom 78 (platinum), because otherwise the inchi stuff crashes.
Molecules with an N atom in are by default converted to SMILES by OpenBabel, which used the
[Pt]
syntax, which made things look inconsistent and prevented a round-trip conversion for those adsorbates only. So then I made the OpenBabel converter replace[Pt]
with*
after it's made a SMILES.Now you can do round-trip conversion to and from SMILES for at least these adsorbates:
*C
,*CC
,*=C
,*[H]
,*=O
,*COC*
,*CNC
,*C1CCCC1
,*N
,*N(Cl)
Testing
I wrote unit tests, (that was the lion's share of the work, as usual).
Locally, I'm getting some weird augmented inchi error when I debug in VSCode (
'CH2O2/c2-1-3/h1-2H/u1,3' == 'CH2O2/c2-1-3/h1H,(H,2,3)/u1,2'
) but I can't see why that would have changed. When I run in my console withmake test
I instead get a TestSoluteDatabase::test_saturation_density error (1.93 == 0
).I'm hoping neither of these are actual problems and the GitHub CI runs smoothly. 🤞
Reviewer Tips
The output, log files, etc. will look different.
Species names are now more likely to have
*
in them (than[Pt]
).Hopefully filenames with
*
in aren't a problem.Maybe there are other unintended consequences we'll have to deal with?
Will this mess up anyone's workflows (in ways that they shouldn't just improve their workflows?).
Try it out and report back.