-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add robust metric #122
base: main
Are you sure you want to change the base?
Add robust metric #122
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TimotheeMathieu !
I was wondering if @lorentzenchr has an opinion on this by any chance?
I think we would at least need more tests, for instance checking that cross-validation works on the resulting metric. Also I think all of scikit-learn metrics support sample_weight
. Would it make sense to add it here?
Having a huber loss available as metric makes sense for models fitted with huber loss. Be aware that the huber loss elicits something in between the median and the expectation, so it is not really clear what you get/estimate. The omnipresent point about MSE not being robust has at least 2 important points:
Last but not least, my all time favorite reference: https://arxiv.org/abs/0912.0902 |
Thanks for the comments. @lorentzenchr what I did is not the Huber loss. It is a robust estimator of the mean applied to the squared errors.
If you want to see references, for instance there is Robust location estimator by Huber or more recently Challenging the empirical mean and empirical variance: a deviation study by Catoni. EDIT : I added an explanation in the user guide that gives some equations to explain this. |
@TimotheeMathieu Thanks for the explanation. Now I get it. Something that could be mentioned in the example is the trimmed mean as a simpler entry point to robust estimation. |
This PR use Huber robust mean estimator to make a robust metric.
Description: one of the big challenge of robust machine learning is that the usual scoring scheme (cross_validation with MSE for instance) is not robust. Indeed, if the dataset has some outliers, then the test sets in cross_validation may have outliers and then the cross_validation MSE would give us a huge error for our robust algorithm on any corrupted data. This is why for example robust methods cannot be efficient for regression challenges in kaggle, because the error computation is not robust.
This PR propose a robust metric that would allow us to compute a robust cross-validation MSE for instance.
Example :
This returns