-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KMeans Sparse Init Update #2796
KMeans Sparse Init Update #2796
Conversation
/intelci: run |
cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp
Outdated
Show resolved
Hide resolved
cpp/oneapi/dal/algo/kmeans_init/backend/cpu/compute_kernel_dense.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modifications look good to me.
It would be better to add a couple of comments though.
std::int64_t trial_count = desc.get_local_trials_count(); | ||
if (trial_count == -1) | ||
{ | ||
const auto additional = std::log(cluster_count); | ||
trial_count = 2 + std::int64_t(additional); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the comment describing the logic behind this. Maybe the link to Sklearn docs.
@@ -292,15 +292,21 @@ public: | |||
const algorithmFPType * const pLastAddedCenter, const algorithmFPType * const aWeights, | |||
const algorithmFPType * const pDistSqBest) | |||
{ | |||
algorithmFPType sumOfDist2 = algorithmFPType(0); | |||
size_t csrCursor = 0u; | |||
algorithmFPType sumOfDist2 = algorithmFPType(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the description to updateMinDistForITrials function.
I understand that it was not there before, but I hope that by adding couple of comments at a time we can make oneDAL's code more readable.
//number of trials to pick each centroid from, 2 + int(ln(cluster_count)) works better than vanilla kmeans++ | ||
//https://github.com/scikit-learn/scikit-learn/blob/a63b021310ba13ea39ad3555f550d8aeec3002c5/sklearn/cluster/_kmeans.py#L108 | ||
std::int64_t trial_count = desc.get_local_trials_count(); | ||
if (trial_count == -1) { | ||
const auto additional = std::log(cluster_count); | ||
trial_count = 2 + std::int64_t(additional); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of improvements this give? It seems this changes original behavior. Please also reflect it in the documentation if not already done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually an original behavior in daal and oneDAL distributed, somehow was missed on oneDAL CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@md-shafiul-alam it is labeled like bug. Is it buggy or just fixes the bug?
Unfortunately we don't have good description for some labels, but should be addressed further.
This fixes the bug |
Changes have been incorporated in PR#2815 |
Description
The pull request fixes the difference in centroids between dense and sparse method for KMeans++ init method.
Changes proposed in this pull request: