-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternatives to PCA, such as umap #27
Comments
Playing with UMAP currently! I have it working but it's pretty funky, needs small coefficients. Doesn't seem to be a huge improvement over PCA currently, but it's possible the way I'm doing it isn't ideal. Might include it experimentally in the upcoming release! (Generating a vector with UMAP is also ~30x slower than PCA currently.) |
Very interesting! Given the issues you describe with performance of training, there is a CuML GPU implementation of UMAP (and a lot of other dimensionality reduction algorithms which could be offered) - https://docs.rapids.ai/api/cuml/stable/api/#umap - certainly a larger dependency chain but these days everyone's accepted nvidia's stack as being mandatory so it might be good to make optional at least. I think there is some tuning you can do with base UMAP's hyperparamaters to improve speed and possibly the quality of the generated control vectors. A UMAP expert would be able to look over that and make sure it's set "correctly" given the data - unfortunately that is not me (and likely fewer than 100 of them exist in the world). As far as to why it requires smaller coefficients and why the performance may be hard to quantify as better - I'd love to see some analysis about this from others in the community, or even the UMAP creator himself (or at least one of the aformentioned 100) I'm extremely appreciative that you have implemented it yourself and tried it. Very happy to see such rapid response and that it might even be made available to others. Thank you!!! |
|
Please feel free to use this issue to continue discussing umap and potential improvements! I'm not sure if the current method is the ideal usage of it. |
Thanks @vgel for all this. I don't have a GPU and have little free time for quite some time still but I'm still very curious as to wether nonlinear dim reduction work "better". Here are a few thoughts:
Anyway, I won't have time for about 6-12 months but may do a PR eventually. If anyone's interested, please share your findings, especially negative results! |
Addendum to my thoughts above (I hope nobody will mind!):
|
Hello, I am back and have a tiny bit of free time to devote to explore those ideas. I plan to document things in this fork I saw that the owner of this repo tried already with umap with some success. Can you share your remarks from testing it as in depth as your time allows before diving myself? |
Btw, PaCMAP is an alternative to UMAP that does non linear dimension reduction, has somewhat less free paramters, appears much simpler to install and package, and can actually output a 1 dimension output (it was not initially possible, cf this issue). The author will probably update the package soonish. |
I'm struggling to make umap work, can you tell me :
It's making it harder to investigate PaCMAP |
I am still interested in the answer :) I think I figured out that trying to preserve the scale of the train array helps a lot. See https://github.com/thiswillbeyourgithub/repeng/blob/c0722440ce5f67d8be112ebe7a2ff3fd8e97ae80/repeng/extract.py#L479 Likewise to applying a regularization norm infered from the initial data. |
Btw something like that works great and is present in my fork |
Basically I do umap/pacmap in 3 dimensions to project the samples, then kmeans to find 2 clusters, then substract the mean of each cluster to the sample of the other clusters then apply the pca_diff on the resulting data. It seems to work great. I can push the strength to like x5 and it stays coherent. Lots more things to try! Edit: also the directions are pretty much always orthogonal to what pca diff would do, so it seems like there's a benefit to using umap/pacmap. |
There's a whole large body of work on dimensionality reduction which handles non linearity better - i.e. UMAP. https://umap-learn.readthedocs.io/en/latest/
Is it simple to just "drop" this in place of PCA and get theoretically better results? If not, why?
what about other things, like NMF https://en.wikipedia.org/wiki/Non-negative_matrix_factorization ?
The text was updated successfully, but these errors were encountered: