prototype support for `n_components=1` #83

mathematicalmichael · 2024-11-26T01:12:25Z

$ uv run --with 'ipython, git+https://github.com/mathematicalmichael/PaCMAP@feature/unit-dimension' ipython
 Updated https://github.com/mathematicalmichael/PaCMAP (42a630b)
Installed 26 packages in 243ms

Python 3.11.9 (main, Aug 14 2024, 04:17:21) [Clang 18.1.8 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pacmap import PaCMAP

In [2]: PaCMAP(n_components=1)
Warning: Defaults were chosen around dimension 2. Dimension 1 has not been tested.
Out[2]: PaCMAP(n_components=1, random_state=0)

hyhuang00

Thank you for your update! Here are some of my suggestions w.r.t. current version of pull request.

source/pacmap/pacmap.py

mathematicalmichael · 2024-11-26T04:57:54Z

@hyhuang00 I noticed

    if not self.apply_pca:
            logger.warning(
                "Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!")

in my opinion this warning is a little misleading, as it has no knowledge of the input data. PCA may not be necessary on low-but-still-challenging-to-visualize dimensions, so if I instantiate this class with the intention of fitting it on dim=3 data, this warning would not be relevant.

would you mind if I modified this to under fit?

proposal:

        # Preprocess the dataset
        n, dim = X.shape
        if (not self.apply_pca) and (dim > 20):
            logger.warning(
                "Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!")

maybe 20 is too high? thoughts? 3?

thiswillbeyourgithub · 2024-11-26T06:20:23Z

Hi, maybe this page contains helpful examples to test the quality of 1dim reductions.

mathematicalmichael · 2024-11-26T06:34:28Z

Hi, maybe this page contains helpful examples to test the quality of 1dim reductions.

The datasets on that page don't quite have a natural 1D embedding like they have a 2D clustering.
However - the S curve dataset arguably has a natural 1D representation.

Have you taken a look at my examples in #82? That's been a very good test-case for me because it admits several valid solutions in its equivalence class (one solution is orthogonal projection onto the diagonal of the RGB cube to sort by value, so it's interesting to see which hyperparams incentivize that one vs a hue or saturation-based ordering).

thiswillbeyourgithub · 2024-11-26T07:06:18Z

The datasets on that page don't quite have a natural 1D embedding like they have a 2D clustering. However - the S curve dataset arguably has a natural 1D representation.

What I was thinking is that we can at least roughly see if we end up with 1 blue chunk and 1 yellow chunk.

Have you taken a look at my examples in #82? That's been a very good test-case for me because it admits several valid solutions in its equivalence class (one solution is orthogonal projection onto the diagonal of the RGB cube to sort by value, so it's interesting to see which hyperparams incentivize that one vs a hue or saturation-based ordering).

Unfortunately I don't have the skills to have anything of value to add. But I'm thinking that there can't be a single test. What I'm after is just seeing if PaCMAP completely breaks down if n_components is 1 or if it's at least somewhat usable, in which case I should include it in my exploratory project.

Given the myriad of ways high dimensional data can be structured, the metrics at play, etc, if pacmap does not break down then it's got to be state of the art on something, right? I'm mostly looking for an alternative to umap for exploratory stuff currently.

mathematicalmichael · 2024-11-26T17:59:56Z

@thiswillbeyourgithub I've only been playing with it a day, but my initial impression is that PaCMAP works reasonably well on 1D reductions, but you'd be wise to adjust the params. Also: wow, repeng is very cool. thank you for sharing - I'm very much interested in this topic as well (albeit for separate use-cases)

but just for kicks, here's pacmap defaults on noisy_circles from this link

from sklearn import datasets
from pacmap import PaCMAP
import matplotlib.pyplot as plt

n_samples = 1500
noisy_circles = datasets.make_circles(
    n_samples=n_samples, factor=0.5, noise=0.05, random_state=170
)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=170)

reducer = PaCMAP(n_components=1, n_neighbors=20, random_state=21, save_tree=False)

circles_pacmap = reducer.fit_transform(noisy_circles[0])
plt.scatter(circles_pacmap, np.zeros_like(circles_pacmap), c=noisy_circles[1])
plt.show()

and noisy_moons:

moons_pacmap = reducer.fit_transform(noisy_moons[0])
plt.scatter(moons_pacmap, np.zeros_like(moons_pacmap), c=noisy_moons[1])
plt.show()

in my opinion - when evaluating these 1D embeddings, think of them as points around a circle instead of a line.
the circles example, rescaled to [0, 2pi]:

SETUP: m1 mac, ephemeral virtualenv to launch ipython against my fork

uv run --with 'ipython, git+https://github.com/mathematicalmichael/PaCMAP@feature/unit-dimension, scikit-learn, matplotlib' ipython

thiswillbeyourgithub · 2024-11-26T18:26:21Z

Just for clarification: I'm not the dev behind repeng, you can see my thoughts in this thread from this message which lead me to creating my own fork for testing a few things. Here's the fork.

Thanks a lot for taking the time to test on those datasets. I think it clearly shows that PaCMAP does not fully breakdown at n_dim=1. I'm saying it that way because I don't know if some mathematical guarantees are violated by setting the dim to 1.

mathematicalmichael · 2024-11-26T18:55:54Z

Just for clarification: I'm not the dev behind repeng, you can see my thoughts in this thread from this message which lead me to creating my own fork for testing a few things. Here's the fork.

Thanks a lot for taking the time to test on those datasets. I think it clearly shows that PaCMAP does not fully breakdown at n_dim=1. I'm saying it that way because I don't know if some mathematical guarantees are violated by setting the dim to 1.

ah, thanks for the clarification. that's a clever suggestion in your issue - and now I understand why you'd like the 1D representation. IMO there's a few nuances to swapping out umap for pacmap (consider hardcoding the params to override defaults), but directionally I agree with the approach of using it for better embedding control. of course, some research should be done on embedding quality - I'm hoping to noodle with that and write up something.

I'll have a look soon at the implementation details for repeng and your fork thereof.

@hyhuang00 regarding this PR: i'll lift the WIP from the title, as I think the "experimental support" is ready.

going back to warning, may revisit logging levels in the future as a separate PR

hyhuang00

LGTM

hyhuang00 · 2024-12-06T14:33:35Z

Thank you again for your effort. While this fix has been incorporated into the main branch. I think we will need to wait for the discussion under #84 to finish before I publish a formal release.

mathematicalmichael · 2024-12-06T15:39:42Z

perhaps. could just be user error though. maybe throw a ValueError with a helpful message when assumptions are not satisfied?

thiswillbeyourgithub · 2024-12-06T16:02:03Z

perhaps. could just be user error though. maybe throw a ValueError with a helpful message when assumptions are not satisfied?

A possibility might be to add an env variable that bypasses the new assumption check if the user really wants to? This could be explained to the user in the ValueError message.

In any case thank you all very much and I'll leave it to you to decide wether to close #82 or not.

@mathematicalmichael you might be interested in my repeng PR IIRC you were working on something related.

Update pacmap.py

42a630b

hyhuang00 requested changes Nov 26, 2024

View reviewed changes

source/pacmap/pacmap.py Outdated Show resolved Hide resolved

source/pacmap/pacmap.py Outdated Show resolved Hide resolved

address comments

18bb8ad

mathematicalmichael force-pushed the feature/unit-dimension branch from cda55ae to 18bb8ad Compare November 26, 2024 04:51

thiswillbeyourgithub mentioned this pull request Nov 26, 2024

Alternatives to PCA, such as umap vgel/repeng#27

Open

debug -> warning

73ee1d9

going back to warning, may revisit logging levels in the future as a separate PR

mathematicalmichael changed the title ~~WIP: prototype support for n_dimension=1~~ prototype support for n_components=1 Nov 26, 2024

mathematicalmichael requested a review from hyhuang00 December 3, 2024 02:28

hyhuang00 approved these changes Dec 6, 2024

View reviewed changes

hyhuang00 merged commit b44a01f into YingfanWang:master Dec 6, 2024

mathematicalmichael deleted the feature/unit-dimension branch December 8, 2024 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prototype support for `n_components=1` #83

prototype support for `n_components=1` #83

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

hyhuang00 left a comment

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024

hyhuang00 left a comment

hyhuang00 commented Dec 6, 2024

mathematicalmichael commented Dec 6, 2024 •

edited

Loading

thiswillbeyourgithub commented Dec 6, 2024

prototype support for n_components=1 #83

prototype support for n_components=1 #83

Conversation

mathematicalmichael commented Nov 26, 2024 • edited Loading

hyhuang00 left a comment

Choose a reason for hiding this comment

mathematicalmichael commented Nov 26, 2024 • edited Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024 • edited Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024 • edited Loading

thiswillbeyourgithub commented Nov 26, 2024

mathematicalmichael commented Nov 26, 2024

hyhuang00 left a comment

Choose a reason for hiding this comment

hyhuang00 commented Dec 6, 2024

mathematicalmichael commented Dec 6, 2024 • edited Loading

thiswillbeyourgithub commented Dec 6, 2024

prototype support for `n_components=1` #83

prototype support for `n_components=1` #83

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

mathematicalmichael commented Nov 26, 2024 •

edited

Loading

mathematicalmichael commented Dec 6, 2024 •

edited

Loading