-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prototype support for n_components=1
#83
prototype support for n_components=1
#83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your update! Here are some of my suggestions w.r.t. current version of pull request.
cda55ae
to
18bb8ad
Compare
@hyhuang00 I noticed if not self.apply_pca:
logger.warning(
"Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!") in my opinion this warning is a little misleading, as it has no knowledge of the input data. would you mind if I modified this to under proposal: # Preprocess the dataset
n, dim = X.shape
if (not self.apply_pca) and (dim > 20):
logger.warning(
"Running ANNOY Indexing on high-dimensional data. Nearest-neighbor search may be slow!") maybe 20 is too high? thoughts? |
Hi, maybe this page contains helpful examples to test the quality of 1dim reductions. |
The datasets on that page don't quite have a natural 1D embedding like they have a 2D clustering. Have you taken a look at my examples in #82? That's been a very good test-case for me because it admits several valid solutions in its equivalence class (one solution is orthogonal projection onto the diagonal of the RGB cube to sort by value, so it's interesting to see which hyperparams incentivize that one vs a hue or saturation-based ordering). |
What I was thinking is that we can at least roughly see if we end up with 1 blue chunk and 1 yellow chunk.
Unfortunately I don't have the skills to have anything of value to add. But I'm thinking that there can't be a single test. What I'm after is just seeing if PaCMAP completely breaks down if n_components is 1 or if it's at least somewhat usable, in which case I should include it in my exploratory project. Given the myriad of ways high dimensional data can be structured, the metrics at play, etc, if pacmap does not break down then it's got to be state of the art on something, right? I'm mostly looking for an alternative to umap for exploratory stuff currently. |
@thiswillbeyourgithub I've only been playing with it a day, but my initial impression is that PaCMAP works reasonably well on 1D reductions, but you'd be wise to adjust the params. Also: wow, but just for kicks, here's pacmap defaults on from sklearn import datasets
from pacmap import PaCMAP
import matplotlib.pyplot as plt
n_samples = 1500
noisy_circles = datasets.make_circles(
n_samples=n_samples, factor=0.5, noise=0.05, random_state=170
)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=0.05, random_state=170)
reducer = PaCMAP(n_components=1, n_neighbors=20, random_state=21, save_tree=False)
circles_pacmap = reducer.fit_transform(noisy_circles[0])
plt.scatter(circles_pacmap, np.zeros_like(circles_pacmap), c=noisy_circles[1])
plt.show() moons_pacmap = reducer.fit_transform(noisy_moons[0])
plt.scatter(moons_pacmap, np.zeros_like(moons_pacmap), c=noisy_moons[1])
plt.show() in my opinion - when evaluating these 1D embeddings, think of them as points around a circle instead of a line. SETUP: m1 mac, ephemeral virtualenv to launch uv run --with 'ipython, git+https://github.com/mathematicalmichael/PaCMAP@feature/unit-dimension, scikit-learn, matplotlib' ipython |
Just for clarification: I'm not the dev behind repeng, you can see my thoughts in this thread from this message which lead me to creating my own fork for testing a few things. Here's the fork. Thanks a lot for taking the time to test on those datasets. I think it clearly shows that PaCMAP does not fully breakdown at n_dim=1. I'm saying it that way because I don't know if some mathematical guarantees are violated by setting the dim to 1. |
ah, thanks for the clarification. that's a clever suggestion in your issue - and now I understand why you'd like the 1D representation. IMO there's a few nuances to swapping out umap for pacmap (consider hardcoding the params to override defaults), but directionally I agree with the approach of using it for better embedding control. of course, some research should be done on embedding quality - I'm hoping to noodle with that and write up something. I'll have a look soon at the implementation details for @hyhuang00 regarding this PR: i'll lift the WIP from the title, as I think the "experimental support" is ready. |
going back to warning, may revisit logging levels in the future as a separate PR
n_dimension=1
n_components=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you again for your effort. While this fix has been incorporated into the main branch. I think we will need to wait for the discussion under #84 to finish before I publish a formal release. |
perhaps. could just be user error though. maybe throw a ValueError with a helpful message when assumptions are not satisfied? |
A possibility might be to add an env variable that bypasses the new assumption check if the user really wants to? This could be explained to the user in the ValueError message. In any case thank you all very much and I'll leave it to you to decide wether to close #82 or not. @mathematicalmichael you might be interested in my repeng PR IIRC you were working on something related. |
see #82