-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt ZairaChem to regression tasks #31
Comments
@miquelduranfrigola,
Classification might be a temporary workaround. |
Hello @HellenNamulinda - thanks for your insightful comments. I completely agree with your points. |
We will start by doing some mild tests outside ZairaChem with @JHlozek. Select one model we know well and compare a normal regression with a classifier-based surrogate regression - and we will then decide which approach we take |
Updates: For proof of concept, my approach is to:
I plan to do this as a 3-fold cross validation for two models, one with a larger dataset (solubility, plasmodium, etc) and one with a smaller dataset (maybe caco). The exploration around the solubility end-point is underway. When those initial results are back, the future roadmap is:
The regression aspects of the ZairaChem code will not be touched until the code refactoring is complete (no rush here as the exploration above will take some time). |
Hi @JHlozek and @miquelduranfrigola Thanks for the information. I maybe have lost some important information from the last meetings, sorry. Wasn't interpretability prioritised over regression? The current refactoring of ZairaChem is eliminating most of the regression non used code and it will take substantial time to include it again 😓. I thought we agreed on classifiers + interpretability + olinda as the first complete deliverable for the new version of ZairaChem, which is already quite a lot! If the priorities are shifting and we are really considering regression maybe we should discuss this again as it will substantially change the work I am doing in the refactoring, and the sooner I change strategy the better if that is what is needed. Sorry I missed this bit from this week's meetings. |
I've continued to give some thought to this. If regression is going to be incorporated in the next update, there are some key design questions that need to be decided now:
Let me know your thoughts so I can prioritise what to do next! Thanks |
Hi @GemmaTuron, no problem and thanks for flagging this. My understanding was indeed that regression was deprioritized in favour of the interpretability, but not that regression was going to be dropped altogether. I mentioned at one of the recent Wed meetings that I was carrying on tinkering with the regression aspect still on the side and would share updates in a couple weeks - especially because it's the one thing that we keep being criticized on with each new citation. I hope it wasn't a Wed that you also happened to not make it to. :/ On Tuesday we discussed the interpretability updates that I've reflected here: #49. The way forward for the substructure highlights is not obvious, so we thought it better to pick it up again in the new year and instead I could carry on with the regression exploration for the last few days before the holidays. Perhaps lets meet today or tomorrow with @miquelduranfrigola and re-align? On the technical aspects if we carry on with regression:
|
I have a tight schedule tomorrow for a meeting, unfortunately. I have however reworked the setup section of ZairaChem-Docker which now is:
Hope this helps. I will continue with the descriptors now |
So far I've been comparing using a multi-classifier approach to a multi-regressor approach as well as a combination of both for modelling the H3D solubility data. The number of classifiers used (n=3/5/7) did not have a major influence here (data not shown). I also tried various label transformations from ZairaChem with the goal to identify one transformation to set as standard in ZairaChem. However, these also didn't have a massive effect on the r^2 values for H3D experimental data based on a single run (see table below).
The way forward is to compare the multi-classifier and multi-regressor approaches for two more H3D models (plasmodium activity and hepatocyte clearance) using the raw data and log-transformed data to see if similar patterns are observed. |
Thanks Jason, super useful and, as always, very comprehensive analysis. |
Motivation
At the moment, ZairaChem only works with binary classification tasks. However, in a real-world scenario, we often encounter regression tasks, for example, to predict the IC50 values or pChEMBL values. We would like to extend ZairaChem to work with regression tasks.
Suggested approach
We see two possible approaches to the problem:
It is not clear yet which approach is best. I am personally inclined towards the second option, although it may end up being too computationally demanding. In the roadmap below, I assume we take this option.
Roadmap
The text was updated successfully, but these errors were encountered: