This repository contains the code and auxiliary data associated with the "Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions" project. Code is provided "as-is". Minor edits may be required to tailor the scripts for different computational systems.
To set up the conda environment:
conda env create --name <env-name> --file environment.yml
Install BDE from the official repo
- BDE: Extracted and analyse radical dataset
Comments about some libraries installed
- Pebble: Multiprocessing
- DRFP: Fingerprints generation
- MORFEUS: Calculation of buried volumes
- autodE: Calculation of reaction profiles
- xyz2mol: Convert cartesian coordinates to RDKit mol object
Additionally, in order to execute the autodE high-throughput reaction profile computation workflow, Gaussian09/Gaussian16 and xTB needs to be accessible.
The scripts used for the generation of the various datasets can be found in the scripts/gen_datasets
directory.
For the generation of the HAT chemical space, the radical database compiled by
St. John et al. in the data
directory is needed. Generating the chemical space and sampling a
subset of representative reactions is done with the help of:
python generation_HAT.py
By default, the script generates several .csv
files in the data
directory, the main ones are:
reactions_2k_autodE.csv
, input for the reaction profile generation workflow.subset_30_g16_xtb_autodE.csv
, input for benchmarking.reactions_1M.csv
, the chemical space.reaction_db_wo_duplicates.csv
, dataset of unique BDE.
A subset of reactions was extracted from the RMechDB database compiled by Tavakoli et al.. The script for curating the original data can be executed as follow:
python clean_data_RMechDB.py
The curated data can be found at data/RMechDB_clean.csv
.
Small dataset of 24 activation energies for HAT by the cytochrome P450 enzyme from organic compound, paper here. The script for curating the original data can be executed as follow:
python clean_data_tantillo.py
The data can be found in the /data/tantillo_data
directory.
238 computed reaction profiles for alkoxy radicals abstracting hydrogens from hydrocarbons and heterosubstituted compounds in an acetonitrile solution, paper here. The script for curating the original data can be executed as follow:
python clean_data_omega.py
The data can be found in the /data/omega_data
directory.
Input files for high-throughput reaction profile computations, based on the reaction SMILES outputted in the previous section, can be generated
with the help of the 0_calculations.py
script in the scripts/autodE
directory as follows:
python 0_calculations.py --final_dir <name of the autodE folder to be generated> --csv_file <path to the input .csv file> [--conda_env <conda environment of autodE package>]
The script batches up the reaction SMILES in individual Python files, each containing a linear sequence of autodE reaction profile computations,
and places them in the autodE_input
. Additionally, the script generates sub-directories for each individual reaction SMILES in the
autodE_input
(the sub-directories are numbered based on the indices in the input file). By default, 24 cores are used per computation and the
M06-2X/def2tzvp//M06-2X/def2svp level of theory is used. These default options can be adjusted by modifying the script.
Under the condition that the scripts in the previous section have been run, the 0.calculations.py
script can for example be executed as follows:
python 0_calculations.py --csv_file ../../data/reactions_2k_autodE.csv --final_dir ../../autodE_input
The autodE
directory also contains a script to extract the relevant output from the autodE_folder
. This script
can be executed as follows:
python aux_script.py
In our workflow, the aux_script.py
is copied into each sub-directories and executed once autodE is finished. This script will copy all
relevant files (.log
, .xyz
, .csv
) to a new directory (0.autode_resume_dir/<rxn_###>
).
To summarize all the results in a final .csv file, the 1_summarize_results.py
script is provided. It creates a .csv
file containing the successfully
computed reaction SMILES together with the activation and reaction free energies (autode_results.csv
). The workflow can be executed as follow:
python 1_summarize_results.py
Examples of this can be found in the autodE_input
directory.
Note that some reactions require postprocessing to make the geometry of the products compatible with the selected TS conformer or to avoid some wrong TS. The scripts below have also been included in this repository to facilitate this postprocessing.
Despite the TS quality checks present in autodE, some erroneous reaction profiles were not recognized as such and consequently, had to be filtered out manually. Once autodE generates the conformers of the TS, the connectivity is not revisited and can lead to erroneous structures. Using the same autodE functions, the graphs for both reactants and TS were generated and the connectivity between every atom was checked. This script can be executed as follows:
python check_connectivity.py
The script will generate a csv-file (data/df_conn_check.csv
) with the reactions to inspect. The available file is with the inspection done.
We subsequently inspected visually the normal modes of all the TSs with imaginary frequencies data/rxns_freq_500.csv
.
Lastly, 2_post_processing.py
checks for compatibility between products and TS. In a first step of this script,
the coordinates of the subset of atoms in the TS geometry corresponding to the product molecule are extracted.
The extracted geometry is then optimized with GFN2-xTB and converted to a SMILES string. If the stereochemistry is not
the same for the original product and the SMILES string obtained from the TS geometry, then the latter geometry is
fully optimized according to the regular autodE workflow, and the reaction profile is updated. The reactions to be
checked are saved in a .txt
file (data/check_TS_stereochemistry
) generated during the first step of this workflow. This
script can be executed as follows:
python 2_post_processing.py
As main output, the script generates a finalized reaction profile folder as well as a .csv
file (data/reactivity_database.csv
)
with all the reaction SMILES and their computed activation and reaction energies.
The script for calculation of tunneling corrections can be found in the same scripts/autodE
directory and is 3_eckart_potential
. This
script is an adaptation of this repo. This script can be executed as follows:
python 3_eckart_potential.py
The script takes all the necessary information from the autodE resume directory (autodE_input/0.autode_resume_dir
), the input is the
final .csv
file of the post-processing step (reactivity_database.csv
), a new column labeled dG_act_corrected
is added and the output is reactivity_database_corrected.csv
,
both files can be found in the data
directory.
All the files related to the baseline models are included in the script/baseline_models
directory. The baseline_model.py
script,
which runs each of the baseline models sequentially, can be executed as follows:
python baseline_models.py [--csv-file <path to the file containing the rxn-smiles>] [--input-file <path to the .pkl input file>] [--split-dir <path to the folder containing the requested splits for the cross validation>] [--n-fold <the number of folds to use during cross validation'>] [--features <features for models based in descriptors>]
For models based on fingerprints, only the csv-file
is necessary, for the models based on descriptors,
the input-file
is necessary as well. The fingerprints are generated during runtime and the DRFP fingerprint is used.
An example of both input files are included in the data
directory: reactivity_database_mapped.csv
and input_ffnn.pkl
. To generate an input file from scratch,
you should use this repo.
Using the scripts in scripts/additional_desc_surrogate
directory, the buried volume and the frozen bond dissociation energy (fr-BDE)
were calculate.
The first step is the calculation of the buried volume. This script can be executed as follows:
python3 buried_vol.py
This script will extract all the xyz-files from the radical database in the xyz_paton
directory, create .csv
files that map the
xyz-file with the smiles and calculate the buried volume using the Morfeus package.
All .csv
files will be stored in the data
directory. The list of .csv
files below is created in this manner:
- df_code_smiles_xyz.csv
- df_smile_mol_code.csv
- df_buried_vol.csv
Before running the second script, all the closed shell molecules of the dataset should be optimized at GFN2-xTB level of theory, we did this
with the launch_xtb_molecule.sh
bash script that can be found in the same folder. The log-file of the optimization and the xyz-file
of the optimized molecule should be in xtb_opt_paton
directory.
Once this step is done. The fr-BDE is computed using:
python3 frozen_energies.py
frozen_energies.py
will start checking if all the optimizations finished successfully and if they are minima through an inspection
of the frequencies. Then it will generate the frozen radicals (by removing a hydrogen) and will launch all the single points in a new directory
(xyz_frozen_paton_rad
). The fr-BDE will be saved in paton_rxns_fr.csv
file in the data directory. Those calculations can be found
here.
In the directory scripts/surrogate_model
, the script to make the training and test set for the surrogate model can be found. It can be executed
as follows:
python create_data_surrogate_model.py
The generated files can be found here
If (parts of) this workflow are used as part of a publication please cite the associated paper:
@article{hat_predictor,
title={Repurposing QM Descriptor Datasets for on the Fly Generation of Informative Reaction Representations:
Application to Hydrogen Atom Transfer Reactions},
author={Javier E. Alfonso-Ramos, Rebecca M. Neeser and Thijs Stuyver},
journal="{Digital Discovery}",
year="{2024}",
volume="{3}",
issue="{5}",
pages="{919-931}"
doi="10.1039/D4DD00043A",
url="{https://doi.org/10.1039/D4DD00043A}"
}
Additionally, since the workflow makes heavy use of autodE, please also cite the paper in which this code was originally presented:
@article{autodE,
doi = {10.1002/anie.202011941},
url = {https://doi.org/10.1002/anie.202011941},
year = {2021},
publisher = {Wiley},
volume = {60},
number = {8},
pages = {4266--4274},
author = {Tom A. Young and Joseph J. Silcock and Alistair J. Sterling and Fernanda Duarte},
title = {{autodE}: Automated Calculation of Reaction Energy Profiles -- Application to Organic and Organometallic Reactions},
journal = {Angewandte Chemie International Edition}
}