Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce BBBP result #2

Open
yuhui-zh15 opened this issue Mar 8, 2021 · 5 comments
Open

Reproduce BBBP result #2

yuhui-zh15 opened this issue Mar 8, 2021 · 5 comments

Comments

@yuhui-zh15
Copy link

yuhui-zh15 commented Mar 8, 2021

Hi, thanks for your great work and clear documentation! I'm trying to reproduce your result on BBBP. However, while I followed the exact setting in the README, there seems to be a huge gap between my result (89.4) and the reported number (93.6). I listed all the steps that are fully reproducible. Could you check if there is anything wrong with my side? Thanks a lot for your help in advance!

  1. Create Conda environment:
git clone [email protected]:tencent-ailab/grover.git
cd grover
conda create --name chem --file requirements.txt
conda activate chem
  1. Download the model:
wget https://ai.tencent.com/ailab/ml/ml-data/grover-models/pretrain/grover_base.tar.gz
tar -xvf grover_base.tar.gz
  1. Feature extraction and fine-tuning:
python scripts/save_features.py --data_path exampledata/finetune/bbbp.csv \
                                --save_path exampledata/finetune/bbbp.npz \
                                --features_generator rdkit_2d_normalized \
                                --restart 

python main.py finetune --data_path exampledata/finetune/bbbp.csv \
                        --features_path exampledata/finetune/bbbp.npz \
                        --save_dir model/finetune/bbbp/ \
                        --checkpoint_path grover_base.pt \
                        --dataset_type classification \
                        --split_type scaffold_balanced \
                        --ensemble_size 1 \
                        --num_folds 3 \
                        --no_features_scaling \
                        --ffn_hidden_size 200 \
                        --batch_size 32 \
                        --epochs 10 \
                        --init_lr 0.00015

The training log (quiet.log) is:

Fold 0
Model 0 best val loss = 0.470996 on epoch 9
Model 0 test auc = 0.887339
Ensemble test auc = 0.887339
Fold 1
Model 0 best val loss = 0.476553 on epoch 7
Model 0 test auc = 0.891758
Ensemble test auc = 0.891758
Fold 2
Model 0 best val loss = 0.488360 on epoch 9
Model 0 test auc = 0.904175
Ensemble test auc = 0.904175
3-fold cross validation
Seed 0 ==> test auc = 0.887339
Seed 1 ==> test auc = 0.891758
Seed 2 ==> test auc = 0.904175
overall_scaffold_balanced_test_auc=0.894424
std=0.007127
@TWRogers
Copy link

TWRogers commented Mar 9, 2021

Firstly, thanks to the authors for the easy-to-use codebase, it's deeply appreciated!

I can confirm that I have the same issue as @yuhui-zh15 for BBBP, in my case I get

3-fold cross validation
Seed 0 ==> test auc = 0.901969
Seed 1 ==> test auc = 0.903515
Seed 2 ==> test auc = 0.876906
overall_scaffold_balanced_test_auc=0.894130
std=0.012196

I have done multiple runs and played around with a few things including activating and deactivating args.dense as well as changing the split type to random just in case, but I can't get an AUC close to the one stated in the paper. Some lucky random folds get to 0.95 AUC but this disappears in the averaging.

I will experiment with the large model and some of the other endpoints to see if I have any luck reproducing any of the results.

@TWRogers
Copy link

TWRogers commented Mar 9, 2021

p.s. the downloadable fine-tuned models seem to be much larger than the base model and of varying sizes, so perhaps different hyperparameters were used for each endpoint and even ensembles in some cases? Unfortunately I am having difficulties downloading them to verify.

@yuhui-zh15
Copy link
Author

I tried to finetune the large model, but it seems it is even worse than the base model.

Fold 0
Model 0 best val loss = 0.486441 on epoch 7
Model 0 test auc = 0.893492
Ensemble test auc = 0.893492
Fold 1
Model 0 best val loss = 0.479239 on epoch 8
Model 0 test auc = 0.888364
Ensemble test auc = 0.888364
Fold 2
Model 0 best val loss = 0.490516 on epoch 0
Model 0 test auc = 0.892271
Ensemble test auc = 0.892271
3-fold cross validation
Seed 0 ==> test auc = 0.893492
Seed 1 ==> test auc = 0.888364
Seed 2 ==> test auc = 0.892271
overall_scaffold_balanced_test_auc=0.891375
std=0.002187

@WenjinW
Copy link

WenjinW commented May 13, 2021

Thanks to the author for providing the source code.
Unfortunately, I get the same results as @yuhui-zh15 and @TWRogers on BBBP, and the results are as follows

Model 0 test auc = 0.895133
Ensemble test auc = 0.895133
1-fold cross validation
Seed 0 ==> test auc = 0.895133
overall_scaffold_balanced_test_auc=0.895133
std=0.000000

The test auc (0.895133) is lower than the value reported in the paper (0.936).
Are there any special tricks that need to be considered?

@wuhaoxz
Copy link

wuhaoxz commented Nov 17, 2023

Hello, I also encountered the same problem as you. Have you solved it? @yuhui-zh15 @TWRogers @WenjinW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants