Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt ZairaChem to regression tasks #31

Open
4 tasks
miquelduranfrigola opened this issue Dec 5, 2023 · 10 comments
Open
4 tasks

Adapt ZairaChem to regression tasks #31

miquelduranfrigola opened this issue Dec 5, 2023 · 10 comments
Assignees

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Dec 5, 2023

Motivation

At the moment, ZairaChem only works with binary classification tasks. However, in a real-world scenario, we often encounter regression tasks, for example, to predict the IC50 values or pChEMBL values. We would like to extend ZairaChem to work with regression tasks.

Suggested approach

We see two possible approaches to the problem:

  • Extend ZairaChem with AutoML regression modules: The natural approach would be to extend ZairaChem with AutoML regression modules, like the ones provided by FLAML, AutoGluon etc. While this sounds very reasonable, it may present additional challenges, such as new metrics for validations, harmonization of the y variable, etc.
  • Divide the regression problem into n classification tasks: An alternative solution would be to simply divide the regression problem into n classification tasks, for example, cutting at different percentiles. Then, for each percentile, we would have classification problem for which we could use the vanilla ZairaChem. At the end of the procedures, we could do a meta-regressor based on the predicted probabilities at each cutoff. This approach would be much slower, obviously, but it may be robust and easier to implement.

It is not clear yet which approach is best. I am personally inclined towards the second option, although it may end up being too computationally demanding. In the roadmap below, I assume we take this option.

Roadmap

  • Harmonize y data for a given regression task. Sometimes, regression values are awkwardly distributed and we need to clean them up previous to training. For example, we may want to log-transform values, or power-transform them, or simply remove outliers. While this has been partially implemented in ZairaChem already, a production-ready module is not available yet.
  • Parallelize or, at least, organize multiple ZairaChem runs (for each binary classification cutoff) in a centralized manner, including a shared folder.
  • Write a meta-regressor that takes the output probabilities at each cutoff as input features and returns a regression value. The architecture of the meta-regressor should be as simple as possible, ideally a linear regression or an SVR.
  • Extend default ZairaChem plots to illustrate performance in a regression scenario.
@HellenNamulinda
Copy link
Collaborator

@miquelduranfrigola,
While the 2nd option might be 'easier to implement', I doubt whether it may be the optimal solution.

  • Dividing continuous outcomes/values into classes inherently sacrifices precision, potentially leading to less accurate predictions.
  • As you also highlighted; the multiple classification models will potentiallyl increase complexity and computational overhead(We wouldn't increase computation cost for the users)
  • Won't classifying continuous values require task-specific decisions for cutoff points? will the same cut-offs be adaptable to different regression scenarios?
  • For the meta-regressor, will it be using the same validation metrics for classification, or new metrics?

Classification might be a temporary workaround.
So, directly embracing AutoML regression modules would be the ideal approach to ensure optimal accuracy and alignment with the continuous nature of regression problems.

@miquelduranfrigola
Copy link
Member Author

Hello @HellenNamulinda - thanks for your insightful comments. I completely agree with your points.
This is certainly something we can discuss and will require some thinking. We are faced here with a cost-benefit problem, i.e. your project has limited time, and we need to do our best to produce an acceptable outcome. Let's evaluate the roadmap together. If we find out that regression tasks are feasible, then I am more than open to trying this avenue. For now, for project proposal purposes, we can mention that both options will be considered? Do you agree with this approach?

@GemmaTuron GemmaTuron added this to AI2050 May 8, 2024
@GemmaTuron GemmaTuron moved this to Todo in AI2050 May 8, 2024
@GemmaTuron
Copy link
Member

We will start by doing some mild tests outside ZairaChem with @JHlozek. Select one model we know well and compare a normal regression with a classifier-based surrogate regression - and we will then decide which approach we take

@JHlozek
Copy link
Collaborator

JHlozek commented Dec 11, 2024

Updates:
I'm essentially exploring both options from the original proposed approach simultaneously to identify an optimal solution. My hypothesis is that a combination of approaches might be best. The AutoML regression models continue with original thinking behind ZairaChem while the multi-classification approach could serve as anchors within the range of y.

For proof of concept, my approach is to:

  1. Split the data into train:val:test in a 80:10:10 ratio.
  2. Train 3 ZairaChem classifiers on the training set where the cutoffs are determined using a quantile-based split of the distribution of y labels.
  3. Train a FLAML regressor on the training set for each descriptor specified in the ZairaChem config file.
  4. Use each model to predict the validation set and use the predictions as feature inputs for AutoGluon.
  5. Use the test set to compare the AutoGluon model performance when trained on: i) only the ZairaChem classifier predictions, ii) only the FLAML regressor predictions, iii) a combination of both the ZairaChem classifier and FLAML regressor predictions. These will also be compared to a baseline FLAML model trained with Morgan fingerprints.

I plan to do this as a 3-fold cross validation for two models, one with a larger dataset (solubility, plasmodium, etc) and one with a smaller dataset (maybe caco). The exploration around the solubility end-point is underway.

When those initial results are back, the future roadmap is:

  • Explore the affect of additional classifiers on performance (n=5 or 7).
  • Explore affect of y transformations in ZairaChem to make the labels more amenable to regression.
  • Update the ZairaChem estimation/pooling steps to do the regression fit/predict - there are many functions in the codebase already that can be used.
  • Update the ZairaChem reporting step for regression metrics/plots.

The regression aspects of the ZairaChem code will not be touched until the code refactoring is complete (no rush here as the exploration above will take some time).

@GemmaTuron
Copy link
Member

GemmaTuron commented Dec 11, 2024

Hi @JHlozek and @miquelduranfrigola

Thanks for the information. I maybe have lost some important information from the last meetings, sorry. Wasn't interpretability prioritised over regression? The current refactoring of ZairaChem is eliminating most of the regression non used code and it will take substantial time to include it again 😓. I thought we agreed on classifiers + interpretability + olinda as the first complete deliverable for the new version of ZairaChem, which is already quite a lot!

If the priorities are shifting and we are really considering regression maybe we should discuss this again as it will substantially change the work I am doing in the refactoring, and the sooner I change strategy the better if that is what is needed. Sorry I missed this bit from this week's meetings.

@GemmaTuron
Copy link
Member

GemmaTuron commented Dec 11, 2024

I've continued to give some thought to this. If regression is going to be incorporated in the next update, there are some key design questions that need to be decided now:

  • The refactoring was in the direction of allowing only classification or only regression. This means that if for the same dataset you want a classifier and a regressor the pipeline needs to be run again (incl. descriptors). If we want both options to be available there are a few adaptations to do --> this would be important to decide now
  • I have completed the setup refactoring which now involves substantially less work (and no Mellody) and adapted the files for CLF. I left the REG parts untouched, which means they will need heavy adaptation unless I do that now, but also at the cost of slowing down the rest of the refactoring.
  • @miquelduranfrigola I would like to really understand the auxiliary cut-off still, did we document this somewhere? at the moment is not clear. Seems to be related to the reference file which is now eliminated from the pipeline.
  • @JHlozek related to the above, is this reference file something that you are taking into consideration for the regression? Because in principle we are dropping it.

Let me know your thoughts so I can prioritise what to do next! Thanks

@JHlozek
Copy link
Collaborator

JHlozek commented Dec 12, 2024

Hi @GemmaTuron, no problem and thanks for flagging this.

My understanding was indeed that regression was deprioritized in favour of the interpretability, but not that regression was going to be dropped altogether. I mentioned at one of the recent Wed meetings that I was carrying on tinkering with the regression aspect still on the side and would share updates in a couple weeks - especially because it's the one thing that we keep being criticized on with each new citation. I hope it wasn't a Wed that you also happened to not make it to. :/

On Tuesday we discussed the interpretability updates that I've reflected here: #49. The way forward for the substructure highlights is not obvious, so we thought it better to pick it up again in the new year and instead I could carry on with the regression exploration for the last few days before the holidays. Perhaps lets meet today or tomorrow with @miquelduranfrigola and re-align?

On the technical aspects if we carry on with regression:

  • Wouldn't it be more efficient to only calculate the descriptors once, but have the estimate and pooling steps re-use them for the multiple classifiers and regression learners? It would be good to agree on the way forward especially seeing as you're knee-deep into it already. (I'd also be happy to tackle that aspect if the setup step accounts for the regression possibility if it happens to be helpful).
  • I'm not quite sure which reference file you mean? The regression might need to the vars.py file in ZairaChem but in principle it can be worked around.

@GemmaTuron
Copy link
Member

I have a tight schedule tomorrow for a meeting, unfortunately. I have however reworked the setup section of ZairaChem-Docker which now is:

  • Set in classification by default without a flag to change that (can be easily added)
  • Generates a much simpler file, data.csv, which contains a bin column, as we will not do multitasking.
  • I have re-worked the tasks for regression, and currently they work (smoothen, power transform and quantile transfom) but are not used. I would favour choosing only one task as e have done for the classifier.
  • The flaml[automl] is needed for the smoothen and is commented out on the installation. The regression columns would all be eliminated in the cleaning step, this will need to be changed to keep one of them.

Hope this helps. I will continue with the descriptors now

@JHlozek
Copy link
Collaborator

JHlozek commented Dec 18, 2024

So far I've been comparing using a multi-classifier approach to a multi-regressor approach as well as a combination of both for modelling the H3D solubility data. The number of classifiers used (n=3/5/7) did not have a major influence here (data not shown). I also tried various label transformations from ZairaChem with the goal to identify one transformation to set as standard in ZairaChem. However, these also didn't have a massive effect on the r^2 values for H3D experimental data based on a single run (see table below).

y_transform multi_regressor multi_classifier combined
0 raw 0.52 0.47 0.52
1 log 0.55 0.54 0.54
2 smooth 0.5 0.46 0.48
3 power 0.54 0.52 0.55
4 quantile 0.43 0.42 0.43
5 rank 0.54 0.52 0.54

The way forward is to compare the multi-classifier and multi-regressor approaches for two more H3D models (plasmodium activity and hepatocyte clearance) using the raw data and log-transformed data to see if similar patterns are observed.

@miquelduranfrigola
Copy link
Member Author

Thanks Jason, super useful and, as always, very comprehensive analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Status: Todo
Development

No branches or pull requests

4 participants