Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

much improved NN graph #43

Open
wants to merge 92 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
1ee85b5
attempt to refactor nn graph building
naspert Mar 19, 2018
4bacd5c
update tests
naspert Mar 19, 2018
00bbcdd
fix typo
naspert Mar 19, 2018
b822333
fix tests (avoiding not implemented combinations)
naspert Mar 19, 2018
38aebd0
- fix missing space after colon in dictionary
naspert Mar 19, 2018
524c60f
fix (matlab) GSP url
naspert Mar 19, 2018
ae83814
throw exception when using FLANN + max_dist (produces incorrect results)
naspert Mar 19, 2018
62fc0ce
update test case to fit FLANN & max_dist exception
naspert Mar 19, 2018
6f473fa
implement nn graph using pdist using radius
naspert Mar 20, 2018
25ec6d2
implement radius nn graph with flann
naspert Mar 20, 2018
96b628e
flann returns the squared distance when called with 'euclidean' dista…
naspert Mar 20, 2018
09bbff4
compute sqrt of list properly
naspert Mar 20, 2018
27b9a03
use cyflann instead of pyflann (radius search not working)
naspert Mar 20, 2018
8a1f9b9
check nn graphs building against pdist reference
naspert Mar 20, 2018
6e9e2ac
cyflann needs the flann library to be installed on the system
naspert Mar 20, 2018
811de06
check nn graphs building against pdist reference
naspert Mar 20, 2018
813fe39
backport stuff from cyflann branch
naspert Mar 20, 2018
4a4d597
flann should (mostly) work for knn graphs
naspert Mar 20, 2018
53dffc1
fix pdist warnings
naspert Mar 21, 2018
1309e92
implement and use scipy-ckdtree as default (faster than kdtree)
naspert Mar 21, 2018
90ae9a8
Merge remote-tracking branch 'origin-nas/nn_cyflann' into nn_refactor
naspert Mar 22, 2018
648fa91
backport README changes from master
naspert Mar 22, 2018
96fa5f6
Merge branch 'master' of https://github.com/epfl-lts2/pygsp into nn_r…
naspert Mar 22, 2018
c26e449
Merge branch 'master' into nn_refactor
naspert Mar 22, 2018
8e7c553
add nmslib
naspert Mar 23, 2018
b83e467
test flann when not on windows
naspert Mar 26, 2018
28b7858
use the same code to build sparse matrix for knn and radius
naspert Mar 29, 2018
188c4a6
building the graph with rescale/center=False should also work
naspert Mar 29, 2018
59c131a
Merge pull request #1 from naspert/nmslib
naspert Mar 29, 2018
8e98b77
update doc for nmslib
naspert Mar 29, 2018
08ae29f
enable multithreading with ckdtree/nmslib
naspert Apr 9, 2018
57e9661
Merge branch 'master' into nn_refactor
naspert Jun 20, 2018
a562896
fix _get_extra_repr
naspert Jun 20, 2018
f69c694
Merge branch 'nn_refactor' of https://github.com/naspert/pygsp into n…
mdeff Feb 8, 2019
441341f
Merge branch 'master' into naspert-nn_refactor
mdeff Feb 13, 2019
8a51649
NNGraph: clean and doc (PR #21)
mdeff Feb 14, 2019
9b5d8c0
python 2.7 doesn't support keyword-only args
mdeff Feb 14, 2019
720646e
simplify test_nngraph (PR #21)
mdeff Feb 15, 2019
172d83f
python 2.7 doesn't support dict unpacking
mdeff Feb 15, 2019
be16da9
correct number of edges (PR #21)
mdeff Feb 15, 2019
a879818
order=3 by default (order=0 is not supported by all backends)
mdeff Feb 15, 2019
719d397
avoid deprecation warning
mdeff Feb 15, 2019
eb2ab0b
deal with empty neighborhood
mdeff Feb 15, 2019
b2bfb51
nngraph: don't store features
mdeff Feb 15, 2019
e1879ee
nngraph: further cleanup
mdeff Feb 15, 2019
bf7427f
nngraph: standardize instead of center and rescale
mdeff Feb 15, 2019
ec74ed7
nngraph: simplify default kernel_width
mdeff Feb 16, 2019
c1e1148
nngraph: test empty graph
mdeff Feb 16, 2019
57ce98c
no assertLogs in python 2.7
mdeff Feb 16, 2019
695272b
nngraph: fix symmetrization
mdeff Feb 16, 2019
505e456
nngraph: fix radius cKDTree (PR #21)
mdeff Feb 16, 2019
204ad19
compact code
mdeff Feb 19, 2019
2b25337
NNGraph: allow user to pass parameters to backends
mdeff Feb 19, 2019
8cc3539
fix flann distances
mdeff Feb 19, 2019
ebc5c05
NNGraph: test consistency across backends
mdeff Feb 19, 2019
1167f52
python 2.7 dict unpacking
mdeff Feb 19, 2019
3638cfd
pdist accepts no parameters
mdeff Feb 19, 2019
9b663aa
NNGraph: test distance on a circle
mdeff Feb 20, 2019
4af4118
NNGraph pdist: don't sort twice
mdeff Feb 20, 2019
624af23
NNGraph: fuse knn and radius implementations
mdeff Feb 20, 2019
043579e
nmslib: number of thread is automatically set to max
mdeff Feb 20, 2019
1d22376
order consistent with metric
mdeff Feb 20, 2019
0763076
cleaner error handling
mdeff Feb 20, 2019
3f0c2b5
nngraph: test standardization
mdeff Feb 20, 2019
26e12e3
nngraph: radius estimation
mdeff Feb 20, 2019
080bb5c
fix others uses of radius
mdeff Feb 21, 2019
dad4105
nngraph: check shape of features
mdeff Feb 24, 2019
f544e1e
nngraph: fix definition of gaussian kernel
mdeff Feb 24, 2019
9c8e86e
nngraph: allow users to choose the similarity kernel
mdeff Feb 24, 2019
bfef548
nngraph: fix attributes
mdeff Feb 24, 2019
5c2e856
nngraph: fix intermittent test failure of nmslib
mdeff Feb 24, 2019
af5aeca
nngraph: width = radius / 2
mdeff Feb 24, 2019
0fc8fd1
nngraph: doc and examples
mdeff Feb 25, 2019
cbb2537
nngraph: update history
mdeff Feb 25, 2019
1da0e55
Update nngraph.py
nperraud Feb 25, 2019
17dc1c6
nngraph: only warn for similarity > 1
mdeff Mar 1, 2019
ad5caee
show original exception if nmslib cannot be imported
mdeff Mar 13, 2019
f9cc066
add nn support
nperraud Jul 22, 2019
29d6fa6
make test work
nperraud Jul 22, 2019
280ae3c
fix tests
nperraud Jul 22, 2019
0b1242b
make k=4 to pass tests
nperraud Jul 22, 2019
ddc290e
test
nperraud Jul 25, 2019
9ad6bff
making test pass -- not very clean nn function
nperraud Jul 26, 2019
17e24a7
update tests
nperraud Aug 16, 2019
d933292
Merge branch 'master' into naspert-nn_refactor
nperraud Aug 17, 2019
b25b3ba
small fix
nperraud Aug 17, 2019
8e2ff0e
fix doc
nperraud Aug 17, 2019
74ea306
fix test
nperraud Aug 17, 2019
cc78ff2
fix test
nperraud Aug 17, 2019
92b6fbb
fix test_graphs
nperraud Aug 17, 2019
152bfae
update reduction
nperraud Aug 19, 2019
3212350
Merge branch 'master' into naspert-nn_refactor
mdeff Nov 9, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ addons:
- ubuntu-toolchain-r-test
- sourceline: 'deb http://downloads.skewed.de/apt/trusty trusty universe'
packages:
- libqt5gui5 # pyqt5>5.11 otherwise cannot load the xcb platform plugin
- libflann-dev
- python3-graph-tool
- python-graph-tool
- libqt5gui5 # pyqt5>5.11 fails to load the xcb platform plugin without it

install:
- pip install --upgrade --upgrade-strategy eager .[dev]
Expand Down
9 changes: 9 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,15 @@ A (mostly unmaintained) `Matlab version <https://lts2.epfl.ch/gsp>`_ exists.
.. |conda| image:: https://anaconda.org/conda-forge/pygsp/badges/installer/conda.svg
:target: https://anaconda.org/conda-forge/pygsp


The PyGSP is a Python package to ease
`Signal Processing on Graphs <https://arxiv.org/abs/1211.0053>`_.
The documentation is available on
`Read the Docs <https://pygsp.readthedocs.io>`_
and development takes place on
`GitHub <https://github.com/epfl-lts2/pygsp>`_.
A (mostly unmaintained) `Matlab version <https://epfl-lts2.github.io/gspbox-html>`_ exists.

The PyGSP facilitates a wide variety of operations on graphs, like computing
their Fourier basis, filtering or interpolating signals, plotting graphs,
signals, and filters. Its core is spectral graph theory, and many of the
Expand Down
4 changes: 4 additions & 0 deletions doc/history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,17 @@ History
* New implementation of the Sensor graph that is simpler and scales better.
* A new learning module with three functions to solve standard semi-supervised
classification and regression problems.
* A much improved, fixed, documented, and tested NNGraph. The user can now
select the backend and similarity kernel. The radius can be estimated and
features standardized. (PR #43)
* Import and export graphs and their signals to NetworkX and graph-tool.
* Save and load graphs and theirs signals to / from GraphML, GML, and GEXF.
* Documentation: path graph linked to DCT, ring graph linked to DFT.
* We now have a gallery of examples! That is convenient for users to get a
taste of what the library can do, and to start working from a code snippet.
* Merged all the extra requirements in a single dev requirement.


Experimental filter API (to be tested and validated):

* evaluate a filter bank with ``g(values)``
Expand Down
4 changes: 2 additions & 2 deletions doc/tutorials/optimization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ We start with the graph TV regularization. We will use the :class:`pyunlocbox.so
>>> prob1 = pyunlocbox.solvers.solve([d, r, f], solver=solver,
... x0=x0, rtol=0, maxit=1000)
Solution found after 1000 iterations:
objective function f(sol) = 2.250584e+02
objective function f(sol) = 2.256055e+02
stopping criterion: MAXIT
>>>
>>> fig, ax = G.plot(prob1['sol'])
Expand All @@ -107,7 +107,7 @@ This figure shows the label signal recovered by graph total variation regulariza
>>> prob2 = pyunlocbox.solvers.solve([r, f], solver=solver,
... x0=x0, rtol=0, maxit=1000)
Solution found after 1000 iterations:
objective function f(sol) = 6.504290e+01
objective function f(sol) = 4.376481e+01
stopping criterion: MAXIT
>>>
>>> fig, ax = G.plot(prob2['sol'])
Expand Down
247 changes: 247 additions & 0 deletions pygsp/_nearest_neighbor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
# -*- coding: utf-8 -*-

from __future__ import division

import numpy as np
from scipy import sparse, spatial
from pygsp import utils

_logger = utils.build_logger(__name__)

def _scipy_pdist(features, metric, order, kind, k, radius, params):
if params:
raise ValueError('unexpected parameters {}'.format(params))
metric = 'cityblock' if metric == 'manhattan' else metric
metric = 'chebyshev' if metric == 'max_dist' else metric
params = dict(metric=metric)
if metric == 'minkowski':
params['p'] = order
dist = spatial.distance.pdist(features, **params)
dist = spatial.distance.squareform(dist)
if kind == 'knn':
neighbors = np.argsort(dist)[:, :k+1]
distances = np.take_along_axis(dist, neighbors, axis=-1)
elif kind == 'radius':
distances = []
neighbors = []
for distance in dist:
neighbor = np.flatnonzero(distance < radius)
neighbors.append(neighbor)
distances.append(distance[neighbor])
return neighbors, distances


def _scipy_kdtree(features, _, order, kind, k, radius, params):
if order is None:
raise ValueError('invalid metric for scipy-kdtree')
eps = params.pop('eps', 0)
tree = spatial.KDTree(features, **params)
params = dict(p=order, eps=eps)
if kind == 'knn':
params['k'] = k + 1
elif kind == 'radius':
params['k'] = None
params['distance_upper_bound'] = radius
distances, neighbors = tree.query(features, **params)
return neighbors, distances


def _scipy_ckdtree(features, _, order, kind, k, radius, params):
if order is None:
raise ValueError('invalid metric for scipy-kdtree')
eps = params.pop('eps', 0)
tree = spatial.cKDTree(features, **params)
params = dict(p=order, eps=eps, n_jobs=-1)
if kind == 'knn':
params['k'] = k + 1
elif kind == 'radius':
params['k'] = features.shape[0] # number of vertices

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem here. This does not scale.
According to the doc of scipy.spatial.cKDTree.query

If x has shape tuple+(self.m,), then d has shape tuple+(k,).

Which means tree.query() is going to return a huge matrix of NxN.

This is actually different from scipy.spatial.KDTree.query which, according to the doc, returns :

If k is None, then d is an object array of shape tuple, containing lists of distances.

Which, if I understand correctly, means that tree.query() is going to return an array of size N containing lists with varying length, depending on the number of neighbours in the epsilon ball.

Anyway, I am testing this locally with the fountain dataset that you made available here and it seems that constructing a radius NN graph using this branch of pygsp crashes

G = graphs.NNGraph(points, kind='radius', radius=0.01, 
                   standardize=False, kernel_width=0.1)

whereas using the pip version of pygsp does not.

G = graphs.NNGraph(points, NNtype='radius', epsilon=0.01, 
                   rescale=False, sigma=0.1)

On my initial experiments, I also used scipy.spatial.cKDTree.query_ball_point() which only returns the indices, so distances need to be computed seperately. That might be an alternative maybe?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR #60 about the use of scipy.spatial.cKDTree.query_ball_point()

params['distance_upper_bound'] = radius
distances, neighbors = tree.query(features, **params)
if kind == 'knn':
return neighbors, distances
elif kind == 'radius':
dist = []
neigh = []
for distance, neighbor in zip(distances, neighbors):
mask = (distance != np.inf)
dist.append(distance[mask])
neigh.append(neighbor[mask])
return neigh, dist


def _flann(features, metric, order, kind, k, radius, params):
if metric == 'max_dist':
raise ValueError('flann gives wrong results for metric="max_dist".')
try:
import cyflann as cfl
except Exception as e:
raise ImportError('Cannot import cyflann. Choose another nearest '
'neighbors backend or try to install it with '
'pip (or conda) install cyflann. '
'Original exception: {}'.format(e))
cfl.set_distance_type(metric, order=order)
index = cfl.FLANNIndex()
index.build_index(features, **params)
# I tried changing the algorithm and testing performance on huge matrices,
# but the default parameters seems to work best.
if kind == 'knn':
neighbors, distances = index.nn_index(features, k+1)
if metric == 'euclidean':
np.sqrt(distances, out=distances)
elif metric == 'minkowski':
np.power(distances, 1/order, out=distances)
elif kind == 'radius':
distances = []
neighbors = []
if metric == 'euclidean':
radius = radius**2
elif metric == 'minkowski':
radius = radius**order
n_vertices, _ = features.shape
for vertex in range(n_vertices):
neighbor, distance = index.nn_radius(features[vertex, :], radius)
distances.append(distance)
neighbors.append(neighbor)
if metric == 'euclidean':
distances = list(map(np.sqrt, distances))
elif metric == 'minkowski':
distances = list(map(lambda d: np.power(d, 1/order), distances))
index.free_index()
return neighbors, distances


def _nmslib(features, metric, order, kind, k, _, params):
if kind == 'radius':
raise ValueError('nmslib does not support kind="radius".')
if metric == 'minkowski':
raise ValueError('nmslib does not support metric="minkowski".')
try:
import nmslib as nms
except Exception as e:
raise ImportError('Cannot import nmslib. Choose another nearest '
'neighbors backend or try to install it with '
'pip (or conda) install nmslib. '
'Original exception: {}'.format(e))
n_vertices, _ = features.shape
params_index = params.pop('index', None)
params_query = params.pop('query', None)
metric = 'l2' if metric == 'euclidean' else metric
metric = 'l1' if metric == 'manhattan' else metric
metric = 'linf' if metric == 'max_dist' else metric
index = nms.init(space=metric, **params)
index.addDataPointBatch(features)
index.createIndex(params_index)
if params_query is not None:
index.setQueryTimeParams(params_query)
results = index.knnQueryBatch(features, k=k+1)
neighbors, distances = zip(*results)
distances = np.concatenate(distances).reshape(n_vertices, k+1)
neighbors = np.concatenate(neighbors).reshape(n_vertices, k+1)
return neighbors, distances

def nearest_neighbor(features, metric='euclidean', order=2, kind='knn', k=10, radius=None, backend='scipy-ckdtree', **kwargs):
'''Find nearest neighboors.

Parameters
----------
features : data numpy array
metric : {'euclidean', 'manhattan', 'minkowski', 'max_dist'}, optional
Metric used to compute pairwise distances.

* ``'euclidean'`` defines pairwise distances as
:math:`d(v_i, v_j) = \| x_i - x_j \|_2`.
* ``'manhattan'`` defines pairwise distances as
:math:`d(v_i, v_j) = \| x_i - x_j \|_1`.
* ``'minkowski'`` generalizes the above and defines distances as
:math:`d(v_i, v_j) = \| x_i - x_j \|_p`
where :math:`p` is the ``order`` of the norm.
* ``'max_dist'`` defines pairwise distances as
:math:`d(v_i, v_j) = \| x_i - x_j \|_\infty = \max(x_i - x_j)`, where
the maximum is taken over the elements of the vector.

More metrics may be supported for some backends.
Please refer to the documentation of the chosen backend.
kind : 'knn' or 'radius' (default 'knn')
k : number of nearest neighboors if 'knn' is selected
radius : radius of the search if 'radius' is slected

order : float, optional
The order of the Minkowski distance for ``metric='minkowski'``.
backend : string, optional
* ``'scipy-pdist'`` uses :func:`scipy.spatial.distance.pdist` to
compute pairwise distances. The method is brute force and computes
all distances. That is the slowest method.
* ``'scipy-kdtree'`` uses :class:`scipy.spatial.KDTree`. The method
builds a k-d tree to prune the number of pairwise distances it has to
compute. That is an efficient strategy for low-dimensional spaces.
* ``'scipy-ckdtree'`` uses :class:`scipy.spatial.cKDTree`. The same as
``'scipy-kdtree'`` but with C bindings, which should be faster.
That is the default.
* ``'flann'`` uses the `Fast Library for Approximate Nearest Neighbors
(FLANN) <https://github.com/mariusmuja/flann>`_. That method is an
approximation.
* ``'nmslib'`` uses the `Non-Metric Space Library (NMSLIB)
<https://github.com/nmslib/nmslib>`_. That method is an
approximation. It should be the fastest in high-dimensional spaces.

You can look at this `benchmark
<https://github.com/erikbern/ann-benchmarks>`_ to get an idea of the
relative performance of those backends. It's nonetheless wise to run
some tests on your own data.
'''
if kind=='knn':
radius = None
elif kind=='radius':
k = None
else:
raise ValueError('"kind" must be "knn" or "radius"')

_orders = {
'euclidean': 2,
'manhattan': 1,
'max_dist': np.inf,
'minkowski': order,
}
order = _orders.pop(metric, None)
try:
function = globals()['_' + backend.replace('-', '_')]
except KeyError:
raise ValueError('Invalid backend "{}".'.format(backend))
neighbors, distances = function(features, metric, order,
kind, k, radius, kwargs)
return neighbors, distances


def sparse_distance_matrix(neighbors, distances, symmetrize=True, safe=False, kind = None, k=None):
'''Build a sparse distance matrix from nearest neighbors'''
n_edges = [len(n) - 1 for n in neighbors] # remove distance to self
if safe and kind is None:
raise ValueError('Please specify "kind" to "knn" or "radius" to use the safe mode')

n_vertices = len(n_edges)
if safe and kind == 'radius':
n_disconnected = np.sum(np.asarray(n_edges) == 0)
if n_disconnected > 0:
_logger.warning('{} points (out of {}) have no neighboors. '
'Consider increasing the radius or setting '
'kind=knn.'.format(n_disconnected, n_vertices))

value = np.empty(sum(n_edges), dtype=np.float)
row = np.empty_like(value, dtype=np.int)
col = np.empty_like(value, dtype=np.int)
start = 0
for vertex in range(n_vertices):
if safe and kind == 'knn':
assert n_edges[vertex] == k
end = start + n_edges[vertex]
value[start:end] = distances[vertex][1:]
row[start:end] = np.full(n_edges[vertex], vertex)
col[start:end] = neighbors[vertex][1:]
start = end
W = sparse.csr_matrix((value, (row, col)), (n_vertices, n_vertices))
if symmetrize:
# Enforce symmetry. May have been broken by k-NN. Checking symmetry
# with np.abs(W - W.T).sum() is as costly as the symmetrization itself.
W = utils.symmetrize(W, method='fill')
return W
4 changes: 2 additions & 2 deletions pygsp/filters/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ def filter(self, s, method='chebyshev', order=30):
>>> _ = G.plot(s1, ax=axes[0])
>>> _ = G.plot(s2, ax=axes[1])
>>> print('{:.5f}'.format(np.linalg.norm(s1 - s2)))
0.26808
0.26995

Perfect reconstruction with Itersine, a tight frame:

Expand Down Expand Up @@ -448,7 +448,7 @@ def estimate_frame_bounds(self, x=None):
A=1.708, B=2.359
>>> A, B = g.estimate_frame_bounds(G.e)
>>> print('A={:.3f}, B={:.3f}'.format(A, B))
A=1.723, B=2.359
A=1.839, B=2.359

The frame bounds can be seen in the plot of the filter bank as the
minimum and maximum of their squared sum (the black curve):
Expand Down
2 changes: 1 addition & 1 deletion pygsp/graphs/fourier.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ def coherence(self):
>>> graph.compute_fourier_basis()
>>> minimum = 1 / np.sqrt(graph.n_vertices)
>>> print('{:.2f} in [{:.2f}, 1]'.format(graph.coherence, minimum))
0.75 in [0.12, 1]
0.87 in [0.12, 1]
>>>
>>> # Plot the most localized eigenvector.
>>> import matplotlib.pyplot as plt
Expand Down
4 changes: 1 addition & 3 deletions pygsp/graphs/nngraphs/bunny.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,5 @@ def __init__(self, **kwargs):
'distance': 8,
}

super(Bunny, self).__init__(Xin=data['bunny'],
epsilon=0.02, NNtype='radius',
center=False, rescale=False,
super(Bunny, self).__init__(data['bunny'], kind='radius', radius=0.02,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a radius graph by default for the bunny?

Copy link
Collaborator Author

@mdeff mdeff Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comes from the matlab version. It was however updated to a kNN in version 0.7.0 epfl-lts2/gspbox@6079047. Do you think we should do it here too? If yes, a small PR with justification is best (if you remember why it was radius before and kNN now).

plotting=plotting, **kwargs)
Loading