Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype support for n_components=1 #83

Merged

Conversation

mathematicalmichael
Copy link
Contributor

@mathematicalmichael mathematicalmichael commented Nov 26, 2024

see #82

$ uv run --with 'ipython, git+https://github.com/mathematicalmichael/PaCMAP@feature/unit-dimension' ipython
 Updated https://github.com/mathematicalmichael/PaCMAP (42a630b)
Installed 26 packages in 243ms
Python 3.11.9 (main, Aug 14 2024, 04:17:21) [Clang 18.1.8 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.29.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pacmap import PaCMAP

In [2]: PaCMAP(n_components=1)
Warning: Defaults were chosen around dimension 2. Dimension 1 has not been tested.
Out[2]: PaCMAP(n_components=1, random_state=0)

Copy link
Collaborator

@hyhuang00 hyhuang00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update! Here are some of my suggestions w.r.t. current version of pull request.

source/pacmap/pacmap.py Outdated Show resolved Hide resolved
source/pacmap/pacmap.py Outdated Show resolved Hide resolved
@mathematicalmichael
Copy link
Contributor Author

mathematicalmichael commented Nov 26, 2024

@hyhuang00 I noticed

    if not self.apply_pca:
            logger.warning(
                "Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!")

in my opinion this warning is a little misleading, as it has no knowledge of the input data. PCA may not be necessary on low-but-still-challenging-to-visualize dimensions, so if I instantiate this class with the intention of fitting it on dim=3 data, this warning would not be relevant.

would you mind if I modified this to under fit?

proposal:

        # Preprocess the dataset
        n, dim = X.shape
        if (not self.apply_pca) and (dim > 20):
            logger.warning(
                "Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!")

maybe 20 is too high? thoughts? 3?

@thiswillbeyourgithub
Copy link

Hi, maybe this page contains helpful examples to test the quality of 1dim reductions.

@mathematicalmichael
Copy link
Contributor Author

mathematicalmichael commented Nov 26, 2024

Hi, maybe this page contains helpful examples to test the quality of 1dim reductions.

The datasets on that page don't quite have a natural 1D embedding like they have a 2D clustering.
However - the S curve dataset arguably has a natural 1D representation.

Have you taken a look at my examples in #82? That's been a very good test-case for me because it admits several valid solutions in its equivalence class (one solution is orthogonal projection onto the diagonal of the RGB cube to sort by value, so it's interesting to see which hyperparams incentivize that one vs a hue or saturation-based ordering).

@thiswillbeyourgithub
Copy link

The datasets on that page don't quite have a natural 1D embedding like they have a 2D clustering. However - the S curve dataset arguably has a natural 1D representation.

What I was thinking is that we can at least roughly see if we end up with 1 blue chunk and 1 yellow chunk.

Have you taken a look at my examples in #82? That's been a very good test-case for me because it admits several valid solutions in its equivalence class (one solution is orthogonal projection onto the diagonal of the RGB cube to sort by value, so it's interesting to see which hyperparams incentivize that one vs a hue or saturation-based ordering).

Unfortunately I don't have the skills to have anything of value to add. But I'm thinking that there can't be a single test. What I'm after is just seeing if PaCMAP completely breaks down if n_components is 1 or if it's at least somewhat usable, in which case I should include it in my exploratory project.

Given the myriad of ways high dimensional data can be structured, the metrics at play, etc, if pacmap does not break down then it's got to be state of the art on something, right? I'm mostly looking for an alternative to umap for exploratory stuff currently.

@mathematicalmichael
Copy link
Contributor Author

mathematicalmichael commented Nov 26, 2024

@thiswillbeyourgithub I've only been playing with it a day, but my initial impression is that PaCMAP works reasonably well on 1D reductions, but you'd be wise to adjust the params. Also: wow, repeng is very cool. thank you for sharing - I'm very much interested in this topic as well (albeit for separate use-cases)

but just for kicks, here's pacmap defaults on noisy_circles from this link

image

from sklearn import datasets
from pacmap import PaCMAP
import matplotlib.pyplot as plt

n_samples = 1500
noisy_circles = datasets.make_circles(
    n_samples=n_samples, factor=0.5, noise=0.05, random_state=170
)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=170)

reducer = PaCMAP(n_components=1, n_neighbors=20, random_state=21, save_tree=False)

circles_pacmap = reducer.fit_transform(noisy_circles[0])
plt.scatter(circles_pacmap, np.zeros_like(circles_pacmap), c=noisy_circles[1])
plt.show()

and noisy_moons:
image

moons_pacmap = reducer.fit_transform(noisy_moons[0])
plt.scatter(moons_pacmap, np.zeros_like(moons_pacmap), c=noisy_moons[1])
plt.show()

in my opinion - when evaluating these 1D embeddings, think of them as points around a circle instead of a line.
the circles example, rescaled to [0, 2pi]:
image

SETUP: m1 mac, ephemeral virtualenv to launch ipython against my fork

uv run --with 'ipython, git+https://github.com/mathematicalmichael/PaCMAP@feature/unit-dimension, scikit-learn, matplotlib' ipython

@thiswillbeyourgithub
Copy link

Just for clarification: I'm not the dev behind repeng, you can see my thoughts in this thread from this message which lead me to creating my own fork for testing a few things. Here's the fork.

Thanks a lot for taking the time to test on those datasets. I think it clearly shows that PaCMAP does not fully breakdown at n_dim=1. I'm saying it that way because I don't know if some mathematical guarantees are violated by setting the dim to 1.

@mathematicalmichael
Copy link
Contributor Author

Just for clarification: I'm not the dev behind repeng, you can see my thoughts in this thread from this message which lead me to creating my own fork for testing a few things. Here's the fork.

Thanks a lot for taking the time to test on those datasets. I think it clearly shows that PaCMAP does not fully breakdown at n_dim=1. I'm saying it that way because I don't know if some mathematical guarantees are violated by setting the dim to 1.

ah, thanks for the clarification. that's a clever suggestion in your issue - and now I understand why you'd like the 1D representation. IMO there's a few nuances to swapping out umap for pacmap (consider hardcoding the params to override defaults), but directionally I agree with the approach of using it for better embedding control. of course, some research should be done on embedding quality - I'm hoping to noodle with that and write up something.

I'll have a look soon at the implementation details for repeng and your fork thereof.

@hyhuang00 regarding this PR: i'll lift the WIP from the title, as I think the "experimental support" is ready.

going back to warning, may revisit logging levels in the future as a separate PR
@mathematicalmichael mathematicalmichael changed the title WIP: prototype support for n_dimension=1 prototype support for n_components=1 Nov 26, 2024
Copy link
Collaborator

@hyhuang00 hyhuang00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hyhuang00 hyhuang00 merged commit b44a01f into YingfanWang:master Dec 6, 2024
@hyhuang00
Copy link
Collaborator

Thank you again for your effort. While this fix has been incorporated into the main branch. I think we will need to wait for the discussion under #84 to finish before I publish a formal release.

@mathematicalmichael
Copy link
Contributor Author

mathematicalmichael commented Dec 6, 2024

perhaps. could just be user error though. maybe throw a ValueError with a helpful message when assumptions are not satisfied?

@thiswillbeyourgithub
Copy link

perhaps. could just be user error though. maybe throw a ValueError with a helpful message when assumptions are not satisfied?

A possibility might be to add an env variable that bypasses the new assumption check if the user really wants to? This could be explained to the user in the ValueError message.

In any case thank you all very much and I'll leave it to you to decide wether to close #82 or not.

@mathematicalmichael you might be interested in my repeng PR IIRC you were working on something related.

@mathematicalmichael mathematicalmichael deleted the feature/unit-dimension branch December 8, 2024 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants