Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new functionalities for High Dimensionality problem and improved performance #19

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

jjaranda13
Copy link

New functionalities for High Dimensionality problem and improved performance:

The improvements achieved on cluster library are related with :

  • High Dimensionality problem
  • improved performance, making clustering linear

High dimensionality (HD) problems arºe those which have items with high number of dimensions. There are two types of HD problems:

a)set of items with large number of dimensions.
b)set of items with a limited number of dimensions from a large available number of dimensions:

For example considering dimensions X, Y, Z, K, L, M and the items:
item1=(X=2, Z=5, L=7)
item2=(X=6, Y=5, M=7)

The HD problems involves a high cost computation because distance functions in this case takes more operations than Low dimensionality problems.

For case "b" (valid also for "a"), a new distance for HD problems is available: HDdistItems() ,HDequals()
This distance function compares dimensions between 2 items.
Each dimension of item1 is searched in item2, and if it is found, then the distance takes into account the difference (mahatan style)
if the dimension does not exist in item2, a maximum value is added to the total distance between item1 and item2.

there is no difference with current usage::

   cl = KMeansClustering(users,HDdistItems,HDequals);

Additionally, now the number of iterations can be limited in order to save time
Experimentally, we have concluded that 10 iterations is enough accurate for most cases.
The new HDgetClusters() function is linear. Avoid the recalculation of centroids
whereas original function getClusters() is N*N complex, because recalculate the
centroid when move an item from one cluster to another.
This new function can be used for low and high dimensionality problems, increasing
performance in both cases::

   solution = cl.HDgetclusters(numclusters,max_iterations);

Other new available optimization inside HDcentroid() function in is the use of mean instead median at centroid calculation.
median is more accurate but involves more computations when N is huge.
The function HDcentroid() is invoked internally by HDgetclusters()

jjaranda13 added 13 commits July 20, 2015 11:58
description of new functionalities for high dimensionality problems and improved performance
new contributors
added new function:
HDcentroid()
This file provides functionalities for High dimensionality problems but also for low dimensionality problems
a High dimensionality example
@exhuma
Copy link
Owner

exhuma commented Jul 20, 2015

I'm currently at EuroPython. I'll look into it as soon as possible. In case
you're here as well, we could meet up.
On 20 Jul 2015 15:24, "jjaranda13" [email protected] wrote:

New functionalities for High Dimensionality problem and improved
performance:

The improvements achieved on cluster library are related with :

  • High Dimensionality problem
  • improved performance, making clustering linear

High dimensionality (HD) problems arºe those which have items with high
number of dimensions. There are two types of HD problems:

a)set of items with large number of dimensions.
b)set of items with a limited number of dimensions from a large available
number of dimensions:

For example considering dimensions X, Y, Z, K, L, M and the items:
item1=(X=2, Z=5, L=7)
item2=(X=6, Y=5, M=7)

The HD problems involves a high cost computation because distance
functions in this case takes more operations than Low dimensionality
problems.

For case "b" (valid also for "a"), a new distance for HD problems is
available: HDdistItems() ,HDequals()
This distance function compares dimensions between 2 items.
Each dimension of item1 is searched in item2, and if it is found, then the
distance takes into account the difference (mahatan style)
if the dimension does not exist in item2, a maximum value is added to the
total distance between item1 and item2.

there is no difference with current usage::

cl = KMeansClustering(users,HDdistItems,HDequals);

Additionally, now the number of iterations can be limited in order to save
time
Experimentally, we have concluded that 10 iterations is enough accurate
for most cases.
The new HDgetClusters() function is linear. Avoid the recalculation of
centroids
whereas original function getClusters() is N*N complex, because
recalculate the
centroid when move an item from one cluster to another.
This new function can be used for low and high dimensionality problems,
increasing
performance in both cases::

solution = cl.HDgetclusters(numclusters,max_iterations);

Other new available optimization inside HDcentroid() function in is the
use of mean instead median at centroid calculation.
median is more accurate but involves more computations when N is huge.

The function HDcentroid() is invoked internally by HDgetclusters()

You can view, comment on, or merge this pull request online at:

#19
Commit Summary

  • Update README.rst
  • Update README.rst
  • Update README.rst
  • Update README.rst
  • Update README.rst
  • Update README.rst
  • Update AUTHORS
  • Update util.py
  • Update kmeans.py
  • Update util.py
  • Create HDdistances.py
  • Create HDexample.py
  • Update HDexample.py

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#19.

juanrd0088 and others added 2 commits July 20, 2015 16:07
bug en HddistItems: factor_len no era un numero real y no funcionaba.
@jjaranda13
Copy link
Author

This pull requests is only for add a change provided by my colleage (Juan Ramos)

it is a pity but i am not in europython :-(

@jjaranda13 jjaranda13 closed this Jul 20, 2015
@jjaranda13 jjaranda13 reopened this Jul 20, 2015
@exhuma
Copy link
Owner

exhuma commented May 13, 2018

Sorry for the very late reply. I've had a very crazy year 2017... it's slowly calming down and I will review this in the coming days.

@tim-littlefair
Copy link
Contributor

tim-littlefair commented May 14, 2018 via email

Copy link
Owner

@exhuma exhuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am really happy with these changes. They add useful functionality. However, there are a few general remarks (please don't be shocked by the number of line-comments in this review):

  • The biggest issue I see is that there are some details in the code that are strongly related to your problem:
    • In the code I saw the hard-coded use of the value 10 and 20 on some place, which (I think) are specific to your case. These values should be automatically determined by the input (if possible), or otherwise be passed as function or cluster argument (wherever it makes sense to you).
    • A similar (but less problematic case), I saw the usage of the comments & variable names like "user" and "profile" in some cases, which are again tightly coupled to your problem. Using more generic names will help users understand the code better.
  • I saw the usage of Spanish in some places. As this is an open-source project, this would be better written in English.
  • Lastly, and also considering that this is an open-source project (and also for consistency inside the project itself) I saw quite a lot of (minuscule) violations of PEP8 which should be addressed. You can use a tool like flake8 on your changes to find those lines. The main problems are:
    • Proper use of white-space around operator
    • capitalisation of methods.

I will give you some time to address this, but I understand that this review comes late and that you maybe have other things to do by now. If that is the case, let me know and I will fix those issues myself. I would however prefer that the changes be made by the original author. This would identify the proper person as author of that particular line.

HDexample.py Outdated
num_users=100
numsse=0
numclusters=5 # starts at 5
max_iteraciones=10
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to rename this to max_iterations to keep the code in English, so other readers have an easier time reading this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, i will rename it

HDexample.py Outdated
numclusters=5 # starts at 5
max_iteraciones=10
ts = time.time()
start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify these two lines by simply using:

start_time = datetime.now()

There is also no need to use time.time() first an there is also no need to run strftime. Python takes care of that when printing (but the format will be slightly different).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right

HDexample.py Outdated
ts = time.time()
start_time=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
while numclusters<=50: # compute SSE from num_clusters=5 to 50
supersol=0#supersolucion, distancias entre el clusters y los usuarios.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please translate this comment to English.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, i will do it

HDexample.py Outdated
while numclusters<=50: # compute SSE from num_clusters=5 to 50
supersol=0#supersolucion, distancias entre el clusters y los usuarios.
users=[] # users are the items of this example
for i in range(num_users):#en el range el numero de usuarios
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please translate this comment to English

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, i will do it

HDexample.py Outdated
print " executing...",numclusters
ts = time.time()
st=datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
print st
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not Python3 compatible (print is a function in Python 3). You might want to add this to the beginning of the module (if you are using Python 2):

from __future__ import print_function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i will do

item, closest_centroid):
closest_cluster = my_centroids[centro]

if id(closest_cluster) != id(origin):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why you used id( here instead of using the is operator? If not, this might be a bit more pythonic by writing:

if closest_cluster is origin:
    ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right

cluster/util.py Outdated
def HDcentroid(data):
dict_words={}
dict_weight={}
words_per_user=10 #10 words per user. This value is not used.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed. The value is only used inside the scope of the following for loop, and only introduces "noise" here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree

cluster/util.py Outdated
dict_words={}
dict_weight={}
words_per_user=10 #10 words per user. This value is not used.
num_users_cluster=len(data)# len(data) is the number of users (user=item)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this code is used with a structure other than "users" this comments does not make much sense and could be rewritten to something more generic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree

cluster/util.py Outdated
dict_weight[word]=data[i][2*j+1]
#l is a ordered list of the keywords, with the sum of the weight of every popular keyword
l=dict_words.items()
l.sort(key=lambda x:10000000-x[1])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly has the value 1000000 been used here? This line need a small comment explaining what's happening here, and why 1000000.

cluster/util.py Outdated
l=dict_words.items()
l.sort(key=lambda x:10000000-x[1])

words_per_centroid=min(10,len(l))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10?

Copy link
Owner

@exhuma exhuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised that my travis config was broken. That caused GitHub to mark these changes as broken but did not give any useful output. This should be fixed now. Most changes look good to me now, but unfortunately it does not run in Python3 yet due to an import error.

You should see the error detail on travis.

cluster/util.py Outdated

from __future__ import print_function
import logging
from HDdistances import HD_profile_dimensions
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how the package is installed, this will not work. Python will look for HDistances in the system-wide available packages and it will not be found. This is something that changed in Python 3!

To fix this, either use a relative import like:

from .HDistances import HD_profile_dimensions

or prefix it explicitly with the package name like:

from cluser.HDistances import HD_profile_dimensions

I think I also fixed the travis config, so you should see better output in the pull-request and you should immediately see if it successfully passes all tests in both Python 2 and 3 once you push. In order to benefit from this you should pull in the changes from my develop branch before pushing back. Something like:

git pull https://github.com/exhuma/python-cluster.git develop

before pushing again. I hope it will not cause any conflicts. That depends on how long ago you last pulled from my repo.

@exhuma
Copy link
Owner

exhuma commented Jul 5, 2018

Sorry for the late reply. I did not get any update e-mails from github :(

@jjaranda13
Copy link
Author

change in util.py done

@exhuma
Copy link
Owner

exhuma commented Dec 31, 2020

I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.

I would be willing to work on this together if you are still interested in getting the changes into the library.

The main change would be to extract the application logic from the new functions and expose them via the function arguments.

@jjaranda13
Copy link
Author

I had another look at the code, and unfortunately the logic is too tightly coupled to your application logic ("users" and "keywords"). This means that the changes would only apply and work for your application.

I would be willing to work on this together if you are still interested in getting the changes into the library.

The main change would be to extract the application logic from the new functions and expose them via the function arguments.

Hi Michel Albert

Thanks for your feedback and interest. We (juan Ramos and me) have analyzed your issue and we propose you a possible solution. Let us know if you like it:

"users" will not appear anymore. In its place, we will put "items".
A "user" is a "profile" composed by a certain number of pairs (keyword, weight). We can replace them by pairs of (dimension, value)
In a nutshell:
users-> items
keyword-> dimension
weight -> value

these changes, in combination with according changes on comments, could be considered a generic approach. If you like it, we will modify quickly (in one day)

best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants