-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLARA Clustering algorithm #83
Add CLARA Clustering algorithm #83
Conversation
Thanks @TimotheeMathieu ! Nice work. We need to add this estimator to https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/master/sklearn_extra/tests/test_common.py#L14 and also investigate why the CI fails and fix it before merging. Do you know someone who might be interested in reviewing this PR? Sorry I don't have much availability for reviews at the moment. For more minor changes (that don't change the API, or add new algorithms or solvers, i.e. not this one), if you see there is no review after some number of days and you are confident it will not break any existing use cases, you should go ahead an merge. We have very few active maintainers in this project, and otherwise PRs will just stay there. |
For the review of the PR, maybe @kno10 would be interested ? I will look into the CI, this algorithm is not really meant to be efficient on small datasets so some of the tests are not really adapted. Concerning the minor change, this seems to contradict the comment in #73 (comment) . Is it some exception for minor changes ? |
About @adrinjalali's comment , yes ideally we should review each PR before merging. However all maintainers are busy or unavailable. Personally I don't plan to review each new PR in this repo, but since you are motivated to improve this project, for which I'm very grateful, it would be a shame to prevent you from doing that. Otherwise this project will just die. So I feel a review by contributors and/or external experts would also be very valuable, and that for very simple fixes you should be able to merge your PRs directly if there are no reviews. It's a bit of a compromise between scikit-learn where each PR needs to 2 reviews, but there are 8+ active developers that do reviews, and other scikit-learn-contrib projects where after initial project acceptance maintainers can do whatever they want basically. |
I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project. |
1 similar comment
I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project. |
Agreed, I meant more minor changes, new algorithms should certainly be reviewed #83 (comment) |
Just to fix ideas : this PR is indeed not minor and I think for instance that PR#85 is minor. |
@rth I noticed that the codecov CI disappeared (which is why the CI became all green here), is it normal ? |
@TimotheeMathieu probably is related to this https://about.codecov.io/security-update/ |
I see that you want to implement CLARA as a separate class from kmedoids right? Can I suggest to implement CLARA as one of the possible methods of the kmedoids algorithm? In this way you could offer an 'auto' method where kmedoids itself decides which is the best method to attack the problem as a function of the sample size. Matlab uses a similar approach: https://www.mathworks.com/help/stats/kmedoids.html?s_tid=srchtitle In any case I am looking forward using CLARA in the future. I have 200000+ samples to cluster and it is a bit problematic at the moment. Nice work you are doing! |
@TimotheeMathieu : Thanks for your reply. Yes, I can always use the corresponding git branch for testing but I intend to integrate this algorithm as part of a radar software package: https://github.com/MeteoSwiss/pyrad. It is not convenient for end-users to install dependencies from source code. Hence my interest in having that algorithm included in a release. I will be happy to help if this can speed up things. I don't know of other python packages that include the k-medoids algorithm and are so compatible with scikit-learn. |
A review & test of the CLARA code would be much appreciated and would help me in the process of merging CLARA with main. In particular, because CLARA is meant to be a faster version of kmedoids we have to be careful to be as fast as possible with its code. |
+1. Even just reporting that you tried on some practical example and provide your feedback on the results would already be helpful. About the release, we I don't think we have any specific rules. Usually we would release when there are some significant features / bug fixes. It takes some effort to build wheels for PyPi but it's manageable. Say releasing withing a month or two sounds reasonable assuming we merge this and few of the other PRs. I'll write instructions on how to make a release and @TimotheeMathieu it would be great if you could make it when we decide to go forward with it, so you would be familiar with the process :) |
Also +1 to keep this a separate class. |
What I can do is install the source code and test the CLARA algorithm within my processing framework. My main concern is memory consumption not speed but I can report on that as well. |
Also, I don't know why kmedoids initial coders decided to implement k-medoids in scikit-learn-extra, but eventually, it may be interesting to include k-medoids in scikit-learn because k-medoids verifies the conditions to be included. |
@TimotheeMathieu , I just wanted to let you know that I cloned your CLARA branch (as you may have noticed) and I am currently testing it within my code. At the moment I can already say that I managed to install it properly :) |
@TimotheeMathieu , @rth , I used these specifications: I ran my algorithm in a server at work. I don't know exactly the specifications but I can ask if needed. If it counts on something I fully support the inclusion of the kmedoids and CLARA algorithms in scikit-learn. I think for data with outliers it is a better solution than the kmeans algorithm implemented there and it can be very useful in many applications. If you need further tests or further information let me know. Any suggestions on how to run the algorithm are most welcomed. If you can speed up the new release somehow we would truly appreciate it. |
Thanks @jfigui
PS : A small suggestion : for more in depth Benchmarking you may want to use a proper benchmarking tool as you may want to test dataset increasing in size and check what is the memory/time used to compute on an dataset of different sizes. For instance with neurtu:
From which you get
|
If you want to know more about what I am doing with the CLARA algorithm have a look at this paper: https://amt.copernicus.org/articles/9/4425/2016/ |
Ok I will try, indeed this would be a good idea to speed things up if the the data can stay float16 for instance.
Good, plausible results is already a validation for us. Thanks.
Ok no worries. |
Ok, now we can use float32 and from experiments we gain a factor 2 in the computation time when we use float32 ! |
I have not timed it exactly but I can confirm that it is much faster and it also decreases a lot the memory consumption. Do you think anything else should be done before this pull request gets accepted? |
Thanks for testing it and providing feedback @jfigui ! Would you be by any chance interested in reviewing the code in this PR as well ? :) Edit: see https://www.youtube.com/watch?v=dyxS9KKCNzA to give some ideas on how to review PRs (though the video is quite long). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TimotheeMathieu !
Would it be possible to move changes related to adding support for 32bit in KMedoids to a separate PR, merged first?
A few comments below, otherwise LGTM. Please add a changelog entry.
Co-authored-by: Roman Yurchak <[email protected]>
Could you please merge master into this PR? To sync with your merged PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, please add a changelog entry.
Ok thank you for the review @rth, I merge. |
Hi @TimotheeMathieu and @rth , I was on holiday so I did not have the occasion to review the pull request but I see you managed perfectly well without me :). I will install the main of scikit-learn-extra and make sure that it can be used within the context of my project. I will report to you if I see anything odd. I am looking forward for a new release of scikit-learn-extra that can be easily installed by our users. |
This is not released or I just don't see it? I see that this is merged in June 2021, but latest release is from April 2021. I would like to try CLARA algorithm, since I have a lot of samples (>100000) to cluster. |
This is just not released. You can install the dev version which contains clara by installing from git's source (with |
Linked to issue #23 and PR #73
This PR implement the CLARA algorithm.
CLARA (Clustering for LARge Applications) extends k-medoids approach for a large number of objects. CLARA applies PAM iteratively on multiple subsamples (each new subsample is composed with the medoids of the previous iterations plus random points), and then keep the best result (with respect to the inertia in the whole dataset).
This can be seen as a clever sub-sampling scheme.
The algorithm is linear in sample_size for both time and space complexity.
Example : even on the (relatively) small digits dataset, I gain a factor 2 in computational time. However you will see that the clustering is similar but definitely not the same, there is a tradeoff complexity/efficiency going on here.