-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample Weight Support #12
Comments
Thank you for your feedback and suggestion. I have to admit that it is currently unclear to me how one can use sample weights for Gaussian process and random effects models for Gaussian and also non-Gaussian data. But I have not done a thorough analysis on this. With dependent data (i.e. Gaussian process and random effects models), some things are slightly more complicated compared to independent data assumed e.g. in standard boosting algorithms, where you simply weight the loss / likelihood contribution of every sample. If there exists a sound approach for incorporating sample weights into Gaussian process regression, we can try to add it here. |
Thanks @fabsig - I apologize for my lack of familiar with the underlying math behind Gaussian process modeling, but traditionally I see sample weights implemented by multiplying the loss (or squared loss) per row by the sample weight for that row, and then minimizing the resulting vector. Is that possible with Gaussian process modeling? |
Yes, you are right. But this is not possible for Gaussian processes / random effects, as there is not one loss per sample but only one "global" loss for all samples together. I am not saying that there is no way for doing it, I just did not have time to think about it and research it in detail (e.g. for Gaussian data, one might weight the error variances accordingly, but this only works for Gaussian data...). |
Understood. I took a look at how the MERF package handles this, but the solution does not seem transferable unfortunately. Thanks for all your work on this package - eager to see a solution if one exists. |
Thank you for the hint. No, this is not a meaningful option. In my opinion, the option proposed in the issue you mention (allowing the user to provide weights to the random forest function) also makes no sense for the MERF algorithm. With this option, one only considers the weights in the fixed effects estimation step but ignores them in the other steps of the MERF algorithm (estimation of variance parameters, estimation of random effects). And since all these steps are connected to each other, it is unclear what is being done... Technically speaking, we could also implement something similar, but it does not make much sense. This is not a software engineering problem but a statistical problem. |
Thanks - that's helpful context. FWIW, I would be interested in a similar solution, even if it is not statistically sound. It may still generate useful results in some contexts, even subject to those limitations. The upside is that I don't think it should be too much to work to implement (since all the scikit-learn base estimators already accept sample_weights), but I understand if you don't want to add misleading or half-baked functionality to this package. Either way, thank you for all your efforts on this. |
As I understand it, for a random-effects model where the response is assumed to have gaussian error, the response vector has a multivariate gaussian likelihood with covariance: Could sample-weights be implemented by replacing I'm not sure how this can be extended to non-gaussian responses (I don't quite know how GPBoost implements these), but wanted to check if this might be helpful for the gaussian case at least. |
Yes, this seems like a reasonable approach for Gaussian data. That's the same approach I also mentioned in this comment:
I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome. |
This would be very helpful if possible to add. It's unfortunately over my head to work on, both in a math and coding sense. |
I have some familiarity with this topic, it's very tricky. First, I'm assuming this discussion is all about 'probability' == 'sampling' == 'scale up' == 'representation' weights? Not to be confused with all the other meanings of 'weight' If so, this is a hard methodological and computational problem. As alluded to in the proposed idea here the problem is, As @tslumley puts it "where do you stick the probability representation weights?" https://notstatschat.rbind.io/2018/03/13/why-pairwise-likelihood/ |
Yes, using this terminology it's about 'probability' == 'sampling' == 'scale up' == 'representation' weights. Afaik, this is the predominant way how weights are used in machine learning. You want to give some observations a higher "weight" (for whatever reasons, whether it's really scaling up sampling probabilities to population probabilities or simply based on heuristic arguments...). I think the approach mentioned by @jwdink makes sense for data with a Gaussian likelihood. For independent data with a Gaussian likelihood (OLS regression, tree-boosting / random forest / neural networks for regression, etc.), dividing variances by the weights is equivalent to multiplying every log-likelihood / loss contribution by the corresponding weight. In analogy to this, you can divide the error variance / nugget effect variance by the weights in a mixed effects / GP model. This seems like a reasonable solution for "where to stick the weights". For non-Gaussian data, it is currently unclear to me how to handle weights. @mikejacktzen: the blog article you mention is about the use of pairwise composite likelihoods, which is an arguably related but also slightly different issue. As said, I will keep this issue open with the "enhancement" label. But it is not at the top of my to-do list. Contributions are welcome. |
Hello - thanks for the wonderful package. From the writeup and the description, it seems very promising.
I wanted to check if GPBoost supports or will support sample weights? I have tried in both the native API and scikit-learn API, and gotten the following error message:
GPBoostError: Weighted data is currently not supported for the GPBoost algorithm. If this is desired, contact the developer or open a GitHub issue.
It's a bit confusing since the API has support for sample weights seemingly, but it looks like they may just not be implemented yet? If so, are there any plans to implement them? This is a key functionality for some domains, where observations many have radically different weights, and fitting an unweighted set will tend to give misleading results.
Thanks!
The text was updated successfully, but these errors were encountered: