new functionalities for High Dimensionality problem and improved performance #19

jjaranda13 · 2015-07-20T13:23:40Z

New functionalities for High Dimensionality problem and improved performance:

The improvements achieved on cluster library are related with :

High Dimensionality problem
improved performance, making clustering linear

High dimensionality (HD) problems arºe those which have items with high number of dimensions. There are two types of HD problems:

a)set of items with large number of dimensions.
b)set of items with a limited number of dimensions from a large available number of dimensions:

For example considering dimensions X, Y, Z, K, L, M and the items:
item1=(X=2, Z=5, L=7)
item2=(X=6, Y=5, M=7)

The HD problems involves a high cost computation because distance functions in this case takes more operations than Low dimensionality problems.

For case "b" (valid also for "a"), a new distance for HD problems is available: HDdistItems() ,HDequals()
This distance function compares dimensions between 2 items.
Each dimension of item1 is searched in item2, and if it is found, then the distance takes into account the difference (mahatan style)
if the dimension does not exist in item2, a maximum value is added to the total distance between item1 and item2.

there is no difference with current usage::

   cl = KMeansClustering(users,HDdistItems,HDequals);

Additionally, now the number of iterations can be limited in order to save time
Experimentally, we have concluded that 10 iterations is enough accurate for most cases.
The new HDgetClusters() function is linear. Avoid the recalculation of centroids
whereas original function getClusters() is N*N complex, because recalculate the
centroid when move an item from one cluster to another.
This new function can be used for low and high dimensionality problems, increasing
performance in both cases::

   solution = cl.HDgetclusters(numclusters,max_iterations);

Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation.
median is more accurate but involves more computations when N is huge.
The function HDcentroid() is invoked internally by HDgetclusters()

description of new functionalities for high dimensionality problems and improved performance

new contributors

added new function: HDcentroid()

This file provides functionalities for High dimensionality problems but also for low dimensionality problems

a High dimensionality example

exhuma · 2015-07-20T13:39:20Z

I'm currently at EuroPython. I'll look into it as soon as possible. In case
you're here as well, we could meet up.
On 20 Jul 2015 15:24, "jjaranda13" [email protected] wrote:

New functionalities for High Dimensionality problem and improved
performance:

The improvements achieved on cluster library are related with :

High Dimensionality problem

improved performance, making clustering linear

High dimensionality (HD) problems arºe those which have items with high
number of dimensions. There are two types of HD problems:

a)set of items with large number of dimensions.
b)set of items with a limited number of dimensions from a large available
number of dimensions:

For example considering dimensions X, Y, Z, K, L, M and the items:
item1=(X=2, Z=5, L=7)
item2=(X=6, Y=5, M=7)

The HD problems involves a high cost computation because distance
functions in this case takes more operations than Low dimensionality
problems.

For case "b" (valid also for "a"), a new distance for HD problems is
available: HDdistItems() ,HDequals()
This distance function compares dimensions between 2 items.
Each dimension of item1 is searched in item2, and if it is found, then the
distance takes into account the difference (mahatan style)
if the dimension does not exist in item2, a maximum value is added to the
total distance between item1 and item2.

there is no difference with current usage::

cl = KMeansClustering(users,HDdistItems,HDequals);

Additionally, now the number of iterations can be limited in order to save
time
Experimentally, we have concluded that 10 iterations is enough accurate
for most cases.
The new HDgetClusters() function is linear. Avoid the recalculation of
centroids
whereas original function getClusters() is N*N complex, because
recalculate the
centroid when move an item from one cluster to another.
This new function can be used for low and high dimensionality problems,
increasing
performance in both cases::

solution = cl.HDgetclusters(numclusters,max_iterations);

Other new available optimization inside HDcentroid() function in is the
use of mean instead median at centroid calculation.
median is more accurate but involves more computations when N is huge.

The function HDcentroid() is invoked internally by HDgetclusters()

You can view, comment on, or merge this pull request online at:

#19
Commit Summary

Update README.rst

Update README.rst

Update README.rst

Update README.rst

Update README.rst

Update README.rst

Update AUTHORS

Update util.py

Update kmeans.py

Update util.py

Create HDdistances.py

Create HDexample.py

Update HDexample.py

File Changes

M AUTHORS
https://github.com/exhuma/python-cluster/pull/19/files#diff-0 (6)

A HDexample.py
https://github.com/exhuma/python-cluster/pull/19/files#diff-1 (131)

M README.rst
https://github.com/exhuma/python-cluster/pull/19/files#diff-2 (48)

A cluster/HDdistances.py
https://github.com/exhuma/python-cluster/pull/19/files#diff-3 (71)

M cluster/method/kmeans.py
https://github.com/exhuma/python-cluster/pull/19/files#diff-4 (107)

M cluster/util.py
https://github.com/exhuma/python-cluster/pull/19/files#diff-5 (37)

Patch Links:

https://github.com/exhuma/python-cluster/pull/19.patch

https://github.com/exhuma/python-cluster/pull/19.diff

—
Reply to this email directly or view it on GitHub
#19.

bug en HddistItems: factor_len no era un numero real y no funcionaba.

Update HDdistances.py

jjaranda13 · 2015-07-20T14:19:04Z

This pull requests is only for add a change provided by my colleage (Juan Ramos)

it is a pity but i am not in europython :-(

exhuma · 2018-05-13T18:46:30Z

Sorry for the very late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days.

tim-littlefair · 2018-05-14T11:37:57Z

Michel Thanks very much for getting to this. I've upgraded my environment to use version 1.4.1 of cluster and I can confirm it behaves identically to my hacked version. This was very timely for me, as I am just about to package my own work for publication on PyPi. Thanks again for a great library Tim

…

On Mon, May 14, 2018 at 2:46 AM, Michel Albert ***@***.***> wrote: Sorry for the *very* late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE5Hp0v6ta5VKk0NIw4gwM4KFchxlVOxks5tyH-GgaJpZM4Fb7x_> .

exhuma

In general I am really happy with these changes. They add useful functionality. However, there are a few general remarks (please don't be shocked by the number of line-comments in this review):

The biggest issue I see is that there are some details in the code that are strongly related to your problem:
- In the code I saw the hard-coded use of the value 10 and 20 on some place, which (I think) are specific to your case. These values should be automatically determined by the input (if possible), or otherwise be passed as function or cluster argument (wherever it makes sense to you).
- A similar (but less problematic case), I saw the usage of the comments & variable names like "user" and "profile" in some cases, which are again tightly coupled to your problem. Using more generic names will help users understand the code better.
I saw the usage of Spanish in some places. As this is an open-source project, this would be better written in English.
Lastly, and also considering that this is an open-source project (and also for consistency inside the project itself) I saw quite a lot of (minuscule) violations of PEP8 which should be addressed. You can use a tool like flake8 on your changes to find those lines. The main problems are:
- Proper use of white-space around operator
- capitalisation of methods.

I will give you some time to address this, but I understand that this review comes late and that you maybe have other things to do by now. If that is the case, let me know and I will fix those issues myself. I would however prefer that the changes be made by the original author. This would identify the proper person as author of that particular line.

exhuma · 2018-05-16T06:09:58Z

HDexample.py

+num_users=100
+numsse=0
+numclusters=5 # starts at 5
+max_iteraciones=10


It would be nice to rename this to max_iterations to keep the code in English, so other readers have an easier time reading this.

of course, i will rename it

exhuma · 2018-05-16T06:12:04Z

HDexample.py

+numclusters=5 # starts at 5
+max_iteraciones=10
+ts = time.time()
+start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')


You can simplify these two lines by simply using:

start_time = datetime.now()

There is also no need to use time.time() first an there is also no need to run strftime. Python takes care of that when printing (but the format will be slightly different).

you are right

exhuma · 2018-05-16T06:12:32Z

HDexample.py

+ts = time.time()
+start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
+while numclusters<=50: # compute SSE from num_clusters=5 to 50
+    supersol=0#supersolucion, distancias entre el clusters y los usuarios.


Please translate this comment to English.

of course, i will do it

exhuma · 2018-05-16T06:13:38Z

HDexample.py

+while numclusters<=50: # compute SSE from num_clusters=5 to 50
+    supersol=0#supersolucion, distancias entre el clusters y los usuarios.
+    users=[] # users are the items of this example
+    for i in range(num_users):#en el range el numero de usuarios


Please translate this comment to English

of course, i will do it

exhuma · 2018-05-16T06:14:10Z

HDexample.py

+    print " executing...",numclusters
+    ts = time.time()
+    st=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
+    print st


This is not Python3 compatible (print is a function in Python 3). You might want to add this to the beginning of the module (if you are using Python 2):

from __future__ import print_function

thanks, i will do

exhuma · 2018-05-17T07:20:56Z

cluster/method/kmeans.py

+                    item, closest_centroid):
+                closest_cluster = my_centroids[centro]
+
+        if id(closest_cluster) != id(origin):


Is there a specific reason why you used id( here instead of using the is operator? If not, this might be a bit more pythonic by writing:

if closest_cluster is origin: ...

you are right

exhuma · 2018-05-17T07:22:44Z

cluster/util.py

+def HDcentroid(data):
+    dict_words={}
+    dict_weight={}
+    words_per_user=10 #10 words per user. This value is not used.


This line can be removed. The value is only used inside the scope of the following for loop, and only introduces "noise" here.

exhuma · 2018-05-17T07:24:22Z

cluster/util.py

+    dict_words={}
+    dict_weight={}
+    words_per_user=10 #10 words per user. This value is not used.
+    num_users_cluster=len(data)# len(data) is the number of users (user=item)


Assuming this code is used with a structure other than "users" this comments does not make much sense and could be rewritten to something more generic.

exhuma · 2018-05-17T07:26:33Z

cluster/util.py

+                dict_weight[word]=data[i][2*j+1]
+    #l is a ordered list of the keywords, with the sum of the weight of every popular keyword
+    l=dict_words.items()
+    l.sort(key=lambda x:10000000-x[1])


Why exactly has the value 1000000 been used here? This line need a small comment explaining what's happening here, and why 1000000.

exhuma · 2018-05-17T07:26:54Z

cluster/util.py

+    l=dict_words.items()
+    l.sort(key=lambda x:10000000-x[1])
+
+    words_per_centroid=min(10,len(l))


exhuma

I realised that my travis config was broken. That caused GitHub to mark these changes as broken but did not give any useful output. This should be fixed now. Most changes look good to me now, but unfortunately it does not run in Python3 yet due to an import error.

You should see the error detail on travis.

exhuma · 2018-07-05T08:53:42Z

cluster/util.py


 from __future__ import print_function
 import logging
+from HDdistances import HD_profile_dimensions


Depending on how the package is installed, this will not work. Python will look for HDistances in the system-wide available packages and it will not be found. This is something that changed in Python 3!

To fix this, either use a relative import like:

from .HDistances import HD_profile_dimensions

or prefix it explicitly with the package name like:

from cluser.HDistances import HD_profile_dimensions

I think I also fixed the travis config, so you should see better output in the pull-request and you should immediately see if it successfully passes all tests in both Python 2 and 3 once you push. In order to benefit from this you should pull in the changes from my develop branch before pushing back. Something like:

git pull https://github.com/exhuma/python-cluster.git develop

before pushing again. I hope it will not cause any conflicts. That depends on how long ago you last pulled from my repo.

exhuma · 2018-07-05T08:58:34Z

Sorry for the late reply. I did not get any update e-mails from github :(

jjaranda13 · 2018-07-06T07:38:12Z

change in util.py done

exhuma · 2020-12-31T12:01:49Z

I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.

I would be willing to work on this together if you are still interested in getting the changes into the library.

The main change would be to extract the application logic from the new functions and expose them via the function arguments.

jjaranda13 · 2021-01-12T14:45:16Z

I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.

I would be willing to work on this together if you are still interested in getting the changes into the library.

The main change would be to extract the application logic from the new functions and expose them via the function arguments.

Hi Michel Albert

Thanks for your feedback and interest. We (juan Ramos and me) have analyzed your issue and we propose you a possible solution. Let us know if you like it:

"users" will not appear anymore. In its place, we will put "items".
A "user" is a "profile" composed by a certain number of pairs (keyword, weight). We can replace them by pairs of (dimension, value)
In a nutshell:
users-> items
keyword-> dimension
weight -> value

these changes, in combination with according changes on comments, could be considered a generic approach. If you like it, we will modify quickly (in one day)

best regards

jjaranda13 added 13 commits July 20, 2015 11:58

Update README.rst

7cec18a

description of new functionalities for high dimensionality problems and improved performance

Update README.rst

2252ead

Update README.rst

f1aeaba

Update README.rst

fe70bc2

Update README.rst

7223010

Update README.rst

25ea0d3

Update AUTHORS

f4d416c

new contributors

Update util.py

56f7da5

added new function: HDcentroid()

Update kmeans.py

4dad547

Update util.py

55ae158

Create HDdistances.py

2ae8e4b

This file provides functionalities for High dimensionality problems but also for low dimensionality problems

Create HDexample.py

ee2e62c

Update HDexample.py

f275e9d

a High dimensionality example

juanrd0088 and others added 2 commits July 20, 2015 16:07

Update HDdistances.py

4b58056

bug en HddistItems: factor_len no era un numero real y no funcionaba.

Merge pull request #1 from juanrd0088/master

85e88d7

Update HDdistances.py

jjaranda13 closed this Jul 20, 2015

jjaranda13 reopened this Jul 20, 2015

exhuma requested changes May 17, 2018

View reviewed changes

solved pull request issues

4340c79

exhuma requested changes Jul 5, 2018

View reviewed changes

exuma request

5d818c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new functionalities for High Dimensionality problem and improved performance #19

new functionalities for High Dimensionality problem and improved performance #19

jjaranda13 commented Jul 20, 2015

exhuma commented Jul 20, 2015

The function HDcentroid() is invoked internally by HDgetclusters()

jjaranda13 commented Jul 20, 2015

exhuma commented May 13, 2018

tim-littlefair commented May 14, 2018 via email

exhuma left a comment

exhuma May 16, 2018

jjaranda13 May 17, 2018

exhuma May 16, 2018

jjaranda13 May 17, 2018

exhuma May 16, 2018

jjaranda13 May 17, 2018

exhuma May 16, 2018

jjaranda13 May 17, 2018

exhuma May 16, 2018

jjaranda13 May 17, 2018

exhuma May 17, 2018

jjaranda13 May 17, 2018

exhuma May 17, 2018

jjaranda13 May 17, 2018

exhuma May 17, 2018

jjaranda13 May 17, 2018

exhuma May 17, 2018

exhuma May 17, 2018

exhuma left a comment

exhuma Jul 5, 2018

exhuma commented Jul 5, 2018

jjaranda13 commented Jul 6, 2018

exhuma commented Dec 31, 2020

jjaranda13 commented Jan 12, 2021

new functionalities for High Dimensionality problem and improved performance #19

Are you sure you want to change the base?

new functionalities for High Dimensionality problem and improved performance #19

Conversation

jjaranda13 commented Jul 20, 2015

New functionalities for High Dimensionality problem and improved performance:

exhuma commented Jul 20, 2015

The function HDcentroid() is invoked internally by HDgetclusters()

jjaranda13 commented Jul 20, 2015

exhuma commented May 13, 2018

tim-littlefair commented May 14, 2018 via email

exhuma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

exhuma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

exhuma commented Jul 5, 2018

jjaranda13 commented Jul 6, 2018

exhuma commented Dec 31, 2020

jjaranda13 commented Jan 12, 2021