Hydrogen atom transfer reaction dataset

This repository contains the code and auxiliary data associated with the "Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions" project. Code is provided "as-is". Minor edits may be required to tailor the scripts for different computational systems.

Conda environment

To set up the conda environment:

conda env create --name <env-name> --file environment.yml

Requirements

Install BDE from the official repo

BDE: Extracted and analyse radical dataset

Comments about some libraries installed

Pebble: Multiprocessing
DRFP: Fingerprints generation
MORFEUS: Calculation of buried volumes
autodE: Calculation of reaction profiles
xyz2mol: Convert cartesian coordinates to RDKit mol object

Additionally, in order to execute the autodE high-throughput reaction profile computation workflow, Gaussian09/Gaussian16 and xTB needs to be accessible.

Generating the search space and reaction SMILES

The scripts used for the generation of the various datasets can be found in the scripts/gen_datasets directory.

in-house HAT reactions dataset

For the generation of the HAT chemical space, the radical database compiled by St. John et al. in the data directory is needed. Generating the chemical space and sampling a subset of representative reactions is done with the help of:

python generation_HAT.py

By default, the script generates several .csv files in the data directory, the main ones are:

reactions_2k_autodE.csv, input for the reaction profile generation workflow.
subset_30_g16_xtb_autodE.csv, input for benchmarking.
reactions_1M.csv, the chemical space.
reaction_db_wo_duplicates.csv, dataset of unique BDE.

Atmospheric chemistry dataset

A subset of reactions was extracted from the RMechDB database compiled by Tavakoli et al.. The script for curating the original data can be executed as follow:

python clean_data_RMechDB.py

The curated data can be found at data/RMechDB_clean.csv.

Tantillo's dataset

Small dataset of 24 activation energies for HAT by the cytochrome P450 enzyme from organic compound, paper here. The script for curating the original data can be executed as follow:

python clean_data_tantillo.py

The data can be found in the /data/tantillo_data directory.

Alkoxy radicals

238 computed reaction profiles for alkoxy radicals abstracting hydrogens from hydrocarbons and heterosubstituted compounds in an acetonitrile solution, paper here. The script for curating the original data can be executed as follow:

python clean_data_omega.py

The data can be found in the /data/omega_data directory.

High-throughput reaction profile computation

Input files for high-throughput reaction profile computations, based on the reaction SMILES outputted in the previous section, can be generated with the help of the 0_calculations.py script in the scripts/autodE directory as follows:

python 0_calculations.py --final_dir <name of the autodE folder to be generated> --csv_file <path to the input .csv file> [--conda_env <conda environment of autodE package>]

The script batches up the reaction SMILES in individual Python files, each containing a linear sequence of autodE reaction profile computations, and places them in the autodE_input. Additionally, the script generates sub-directories for each individual reaction SMILES in the autodE_input (the sub-directories are numbered based on the indices in the input file). By default, 24 cores are used per computation and the M06-2X/def2tzvp//M06-2X/def2svp level of theory is used. These default options can be adjusted by modifying the script.

Under the condition that the scripts in the previous section have been run, the 0.calculations.py script can for example be executed as follows:

python 0_calculations.py --csv_file ../../data/reactions_2k_autodE.csv --final_dir ../../autodE_input

The autodE directory also contains a script to extract the relevant output from the autodE_folder. This script can be executed as follows:

python aux_script.py

In our workflow, the aux_script.py is copied into each sub-directories and executed once autodE is finished. This script will copy all relevant files (.log, .xyz, .csv) to a new directory (0.autode_resume_dir/<rxn_###>).

To summarize all the results in a final .csv file, the 1_summarize_results.py script is provided. It creates a .csv file containing the successfully computed reaction SMILES together with the activation and reaction free energies (autode_results.csv). The workflow can be executed as follow:

python 1_summarize_results.py

Examples of this can be found in the autodE_input directory.

Note that some reactions require postprocessing to make the geometry of the products compatible with the selected TS conformer or to avoid some wrong TS. The scripts below have also been included in this repository to facilitate this postprocessing.

Post-processing reaction SMILES

Despite the TS quality checks present in autodE, some erroneous reaction profiles were not recognized as such and consequently, had to be filtered out manually. Once autodE generates the conformers of the TS, the connectivity is not revisited and can lead to erroneous structures. Using the same autodE functions, the graphs for both reactants and TS were generated and the connectivity between every atom was checked. This script can be executed as follows:

python check_connectivity.py

The script will generate a csv-file (data/df_conn_check.csv) with the reactions to inspect. The available file is with the inspection done.

We subsequently inspected visually the normal modes of all the TSs with imaginary frequencies $(|\nu_{imag}| < 500 cm^{-1})$ and is saved in data/rxns_freq_500.csv.

Lastly, 2_post_processing.py checks for compatibility between products and TS. In a first step of this script, the coordinates of the subset of atoms in the TS geometry corresponding to the product molecule are extracted. The extracted geometry is then optimized with GFN2-xTB and converted to a SMILES string. If the stereochemistry is not the same for the original product and the SMILES string obtained from the TS geometry, then the latter geometry is fully optimized according to the regular autodE workflow, and the reaction profile is updated. The reactions to be checked are saved in a .txt file (data/check_TS_stereochemistry) generated during the first step of this workflow. This script can be executed as follows:

python 2_post_processing.py

As main output, the script generates a finalized reaction profile folder as well as a .csv file (data/reactivity_database.csv) with all the reaction SMILES and their computed activation and reaction energies.

Tunneling correction

The script for calculation of tunneling corrections can be found in the same scripts/autodE directory and is 3_eckart_potential. This script is an adaptation of this repo. This script can be executed as follows:

python 3_eckart_potential.py

The script takes all the necessary information from the autodE resume directory (autodE_input/0.autode_resume_dir), the input is the final .csv file of the post-processing step (reactivity_database.csv), a new column labeled dG_act_corrected is added and the output is reactivity_database_corrected.csv, both files can be found in the data directory.

Baseline ML models

All the files related to the baseline models are included in the script/baseline_models directory. The baseline_model.py script, which runs each of the baseline models sequentially, can be executed as follows:

python baseline_models.py [--csv-file <path to the file containing the rxn-smiles>] [--input-file <path to the .pkl input file>] [--split-dir <path to the folder containing the requested splits for the cross validation>] [--n-fold <the number of folds to use during cross validation'>] [--features <features for models based in descriptors>]

For models based on fingerprints, only the csv-file is necessary, for the models based on descriptors, the input-file is necessary as well. The fingerprints are generated during runtime and the DRFP fingerprint is used.

An example of both input files are included in the data directory: reactivity_database_mapped.csv and input_ffnn.pkl. To generate an input file from scratch, you should use this repo.

Compute additional descriptors

Using the scripts in scripts/additional_desc_surrogate directory, the buried volume and the frozen bond dissociation energy (fr-BDE) were calculate.

The first step is the calculation of the buried volume. This script can be executed as follows:

python3 buried_vol.py

This script will extract all the xyz-files from the radical database in the xyz_paton directory, create .csv files that map the xyz-file with the smiles and calculate the buried volume using the Morfeus package. All .csv files will be stored in the data directory. The list of .csv files below is created in this manner:

df_code_smiles_xyz.csv
df_smile_mol_code.csv
df_buried_vol.csv

Before running the second script, all the closed shell molecules of the dataset should be optimized at GFN2-xTB level of theory, we did this with the launch_xtb_molecule.sh bash script that can be found in the same folder. The log-file of the optimization and the xyz-file of the optimized molecule should be in xtb_opt_paton directory.

Once this step is done. The fr-BDE is computed using:

python3 frozen_energies.py

frozen_energies.py will start checking if all the optimizations finished successfully and if they are minima through an inspection of the frequencies. Then it will generate the frozen radicals (by removing a hydrogen) and will launch all the single points in a new directory (xyz_frozen_paton_rad). The fr-BDE will be saved in paton_rxns_fr.csv file in the data directory. Those calculations can be found here.

Generate data surrogate model

In the directory scripts/surrogate_model, the script to make the training and test set for the surrogate model can be found. It can be executed as follows:

python create_data_surrogate_model.py

The generated files can be found here

References

If (parts of) this workflow are used as part of a publication please cite the associated paper:

@article{hat_predictor,
         title={Repurposing QM Descriptor Datasets for on the Fly Generation of Informative Reaction Representations: 
         Application to Hydrogen Atom Transfer Reactions}, 
         author={Javier E. Alfonso-Ramos, Rebecca M. Neeser and Thijs Stuyver},
         journal="{Digital Discovery}",
         year="{2024}",
         volume="{3}",
         issue="{5}",
         pages="{919-931}"
         doi="10.1039/D4DD00043A",
         url="{https://doi.org/10.1039/D4DD00043A}"
}

Additionally, since the workflow makes heavy use of autodE, please also cite the paper in which this code was originally presented:

@article{autodE,
  doi = {10.1002/anie.202011941},
  url = {https://doi.org/10.1002/anie.202011941},
  year = {2021},
  publisher = {Wiley},
  volume = {60},
  number = {8},
  pages = {4266--4274},
  author = {Tom A. Young and Joseph J. Silcock and Alistair J. Sterling and Fernanda Duarte},
  title = {{autodE}: Automated Calculation of Reaction Energy Profiles -- Application to Organic and Organometallic Reactions},
  journal = {Angewandte Chemie International Edition}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hydrogen atom transfer reaction dataset

Conda environment

Requirements

Generating the search space and reaction SMILES

in-house HAT reactions dataset

Atmospheric chemistry dataset

Tantillo's dataset

Alkoxy radicals

High-throughput reaction profile computation

Post-processing reaction SMILES

Tunneling correction

Baseline ML models

Compute additional descriptors

Generate data surrogate model

References

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
autodE_input		autodE_input
data		data
scripts		scripts
xtb_opt_paton		xtb_opt_paton
xyz_frozen_paton_rad		xyz_frozen_paton_rad
xyz_paton		xyz_paton
README.md		README.md
environment.yml		environment.yml

chimie-paristech-CTM/bde_hat

Folders and files

Latest commit

History

Repository files navigation

Hydrogen atom transfer reaction dataset

Conda environment

Requirements

Generating the search space and reaction SMILES

in-house HAT reactions dataset

Atmospheric chemistry dataset

Tantillo's dataset

Alkoxy radicals

High-throughput reaction profile computation

Post-processing reaction SMILES

Tunneling correction

Baseline ML models

Compute additional descriptors

Generate data surrogate model

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages