-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new functionalities for High Dimensionality problem and improved performance #19
base: master
Are you sure you want to change the base?
Conversation
description of new functionalities for high dimensionality problems and improved performance
new contributors
added new function: HDcentroid()
This file provides functionalities for High dimensionality problems but also for low dimensionality problems
a High dimensionality example
I'm currently at EuroPython. I'll look into it as soon as possible. In case
|
bug en HddistItems: factor_len no era un numero real y no funcionaba.
Update HDdistances.py
This pull requests is only for add a change provided by my colleage (Juan Ramos) it is a pity but i am not in europython :-( |
Sorry for the very late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days. |
Michel
Thanks very much for getting to this.
I've upgraded my environment to use version 1.4.1 of cluster and I can
confirm it
behaves identically to my hacked version.
This was very timely for me, as I am just about to package my own work for
publication on PyPi.
Thanks again for a great library
Tim
…On Mon, May 14, 2018 at 2:46 AM, Michel Albert ***@***.***> wrote:
Sorry for the *very* late reply. I've had a very crazy year 2017... it's
slowly calming down and I will review this in the coming days.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE5Hp0v6ta5VKk0NIw4gwM4KFchxlVOxks5tyH-GgaJpZM4Fb7x_>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I am really happy with these changes. They add useful functionality. However, there are a few general remarks (please don't be shocked by the number of line-comments in this review):
- The biggest issue I see is that there are some details in the code that are strongly related to your problem:
- In the code I saw the hard-coded use of the value
10
and20
on some place, which (I think) are specific to your case. These values should be automatically determined by the input (if possible), or otherwise be passed as function or cluster argument (wherever it makes sense to you). - A similar (but less problematic case), I saw the usage of the comments & variable names like "user" and "profile" in some cases, which are again tightly coupled to your problem. Using more generic names will help users understand the code better.
- In the code I saw the hard-coded use of the value
- I saw the usage of Spanish in some places. As this is an open-source project, this would be better written in English.
- Lastly, and also considering that this is an open-source project (and also for consistency inside the project itself) I saw quite a lot of (minuscule) violations of PEP8 which should be addressed. You can use a tool like
flake8
on your changes to find those lines. The main problems are:- Proper use of white-space around operator
- capitalisation of methods.
I will give you some time to address this, but I understand that this review comes late and that you maybe have other things to do by now. If that is the case, let me know and I will fix those issues myself. I would however prefer that the changes be made by the original author. This would identify the proper person as author of that particular line.
HDexample.py
Outdated
num_users=100 | ||
numsse=0 | ||
numclusters=5 # starts at 5 | ||
max_iteraciones=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to rename this to max_iterations
to keep the code in English, so other readers have an easier time reading this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, i will rename it
HDexample.py
Outdated
numclusters=5 # starts at 5 | ||
max_iteraciones=10 | ||
ts = time.time() | ||
start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify these two lines by simply using:
start_time = datetime.now()
There is also no need to use time.time()
first an there is also no need to run strftime
. Python takes care of that when printing (but the format will be slightly different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right
HDexample.py
Outdated
ts = time.time() | ||
start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') | ||
while numclusters<=50: # compute SSE from num_clusters=5 to 50 | ||
supersol=0#supersolucion, distancias entre el clusters y los usuarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please translate this comment to English.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, i will do it
HDexample.py
Outdated
while numclusters<=50: # compute SSE from num_clusters=5 to 50 | ||
supersol=0#supersolucion, distancias entre el clusters y los usuarios. | ||
users=[] # users are the items of this example | ||
for i in range(num_users):#en el range el numero de usuarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please translate this comment to English
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, i will do it
HDexample.py
Outdated
print " executing...",numclusters | ||
ts = time.time() | ||
st=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') | ||
print st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not Python3 compatible (print
is a function in Python 3). You might want to add this to the beginning of the module (if you are using Python 2):
from __future__ import print_function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, i will do
cluster/method/kmeans.py
Outdated
item, closest_centroid): | ||
closest_cluster = my_centroids[centro] | ||
|
||
if id(closest_cluster) != id(origin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why you used id(
here instead of using the is
operator? If not, this might be a bit more pythonic by writing:
if closest_cluster is origin:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right
cluster/util.py
Outdated
def HDcentroid(data): | ||
dict_words={} | ||
dict_weight={} | ||
words_per_user=10 #10 words per user. This value is not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed. The value is only used inside the scope of the following for
loop, and only introduces "noise" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree
cluster/util.py
Outdated
dict_words={} | ||
dict_weight={} | ||
words_per_user=10 #10 words per user. This value is not used. | ||
num_users_cluster=len(data)# len(data) is the number of users (user=item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this code is used with a structure other than "users" this comments does not make much sense and could be rewritten to something more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree
cluster/util.py
Outdated
dict_weight[word]=data[i][2*j+1] | ||
#l is a ordered list of the keywords, with the sum of the weight of every popular keyword | ||
l=dict_words.items() | ||
l.sort(key=lambda x:10000000-x[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly has the value 1000000
been used here? This line need a small comment explaining what's happening here, and why 1000000
.
cluster/util.py
Outdated
l=dict_words.items() | ||
l.sort(key=lambda x:10000000-x[1]) | ||
|
||
words_per_centroid=min(10,len(l)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised that my travis config was broken. That caused GitHub to mark these changes as broken but did not give any useful output. This should be fixed now. Most changes look good to me now, but unfortunately it does not run in Python3 yet due to an import error.
You should see the error detail on travis.
cluster/util.py
Outdated
|
||
from __future__ import print_function | ||
import logging | ||
from HDdistances import HD_profile_dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how the package is installed, this will not work. Python will look for HDistances
in the system-wide available packages and it will not be found. This is something that changed in Python 3!
To fix this, either use a relative import like:
from .HDistances import HD_profile_dimensions
or prefix it explicitly with the package name like:
from cluser.HDistances import HD_profile_dimensions
I think I also fixed the travis config, so you should see better output in the pull-request and you should immediately see if it successfully passes all tests in both Python 2 and 3 once you push. In order to benefit from this you should pull in the changes from my develop
branch before pushing back. Something like:
git pull https://github.com/exhuma/python-cluster.git develop
before pushing again. I hope it will not cause any conflicts. That depends on how long ago you last pulled from my repo.
Sorry for the late reply. I did not get any update e-mails from github :( |
change in util.py done |
I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application. I would be willing to work on this together if you are still interested in getting the changes into the library. The main change would be to extract the application logic from the new functions and expose them via the function arguments. |
Hi Michel Albert Thanks for your feedback and interest. We (juan Ramos and me) have analyzed your issue and we propose you a possible solution. Let us know if you like it: "users" will not appear anymore. In its place, we will put "items". these changes, in combination with according changes on comments, could be considered a generic approach. If you like it, we will modify quickly (in one day) best regards |
New functionalities for High Dimensionality problem and improved performance:
The improvements achieved on cluster library are related with :
High dimensionality (HD) problems arºe those which have items with high number of dimensions. There are two types of HD problems:
a)set of items with large number of dimensions.
b)set of items with a limited number of dimensions from a large available number of dimensions:
The HD problems involves a high cost computation because distance functions in this case takes more operations than Low dimensionality problems.
For case "b" (valid also for "a"), a new distance for HD problems is available: HDdistItems() ,HDequals()
This distance function compares dimensions between 2 items.
Each dimension of item1 is searched in item2, and if it is found, then the distance takes into account the difference (mahatan style)
if the dimension does not exist in item2, a maximum value is added to the total distance between item1 and item2.
there is no difference with current usage::
Additionally, now the number of iterations can be limited in order to save time
Experimentally, we have concluded that 10 iterations is enough accurate for most cases.
The new HDgetClusters() function is linear. Avoid the recalculation of centroids
whereas original function getClusters() is N*N complex, because recalculate the
centroid when move an item from one cluster to another.
This new function can be used for low and high dimensionality problems, increasing
performance in both cases::
Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation.
median is more accurate but involves more computations when N is huge.
The function HDcentroid() is invoked internally by HDgetclusters()