Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically reset PCs based on var size #76

Closed
wants to merge 5 commits into from

Conversation

hewgreen
Copy link
Collaborator

This is required for imaging omic datasets where the gene number (vars) is likely to be very low. Passing PC higher than genes would not be good.

@pinin4fjords pinin4fjords requested a review from nh3 February 21, 2020 11:02
@nh3
Copy link
Collaborator

nh3 commented Feb 21, 2020

use_pc is not a command line option for pca(), I guess you meant n_comps?
There's also n_pcs for neighbors(). They do different things.

BTW, sc.pp.neighbors() does fail with informative message if supplied n_pcs exists the dimension of X_pca.

@pinin4fjords
Copy link
Member

Thanks for reviewing @nh3 . We just don't want the PCA tool to fail when we happen to supply a matrix that invalidates the hard-coded number of PCs. Analagous to the variable genes fix the other day, it should just knock the number of PCs down to the maximum allowed value (perhaps with a warning).

@hewgreen
Copy link
Collaborator Author

Sounds good to me. Should I swap to 'n_comps'?

@pinin4fjords
Copy link
Member

Sounds good to me. Should I swap to 'n_comps'?

Yep- see API doc for the function being called here: https://icb-scanpy.readthedocs-hosted.com/en/stable/api/scanpy.tl.pca.html

@hewgreen
Copy link
Collaborator Author

I was looking here

I wondered why this was different to scanpy.

@pinin4fjords
Copy link
Member

pinin4fjords commented Feb 21, 2020

I was looking here

I wondered why this was different to scanpy.

'n_comps': click.option(

(sorry for earlier incorrect response). It is confusing, but PC-related arguments are used in a number of different places in Scanpy, not always for the same thing. You just picked the wrong one is all.

@pinin4fjords
Copy link
Member

Is this acceptable @nh3 ? An an example, Matt had a matrix with 33 genes. The default of 50 causes a nasty error becuase 50 >= floor(30/2). Seems more graceful to reduce to the maximum possible.

@nh3
Copy link
Collaborator

nh3 commented Feb 21, 2020

I think the method of modifying n_comps is fine. But I don't quite understand why reduce to half as many dimensions rather than n_var. Is that some kind of heuristics?

When not specifying use_rep and n_var < 50, the next step neighbors() automatically uses the expression matrix over the PCs; if use_rep is specified to PCA, there's n_pcs that can be used to reduce the number of desired PCs. So I don't see the necessity of applying heuristics here.

Also, the behaviour of reduce to largest number of PCs mathematically allowed seems more as expected than reduce to some not so well-known fractions.

Just my two cents.

@pinin4fjords
Copy link
Member

Thanks @nh3 - yes, that appears to be the heuristic required (not my own invention, honest!). The 33 genes in the above example required no more than 15 PCs- it really isn't n_var that's required as far as I can see.

The need to apply the heuristics is that the PCA tool will die with small matrices if a default of 50 is left in place, killing the workflow, and we can't run the workflow unsupervised. Just doesn't seem graceful behaviour, and it's just nice to be able to say "50 when you can, otherwise as many as you can".

I actually experienced an issue with the neighours step- it doesn't like the sparse matrices that come from using the expression matrix (was going to be the subject of a future PR)- so we're forcing to use the PCs for now.

@nh3
Copy link
Collaborator

nh3 commented Feb 21, 2020

Have you tried more datasets, and values of svd_solver? The heuristics sounds quite strange to me. scanpy actually handles the situation when n_comps > n_var by reducing it to n_var - 1. If that still fails, I suggest filing an issue to those guys.

https://github.com/theislab/scanpy/blob/85acb6c8949d43d08a26437dceab4fa5db79e246/scanpy/preprocessing/_simple.py#L451

@pinin4fjords
Copy link
Member

@hewgreen will have to have a go- I'm on annual leave for 2 weeks very soon ;-). Here's the error trace:

Traceback (most recent call last):
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/bin/scanpy-run-pca", line 10, in <module>
    sys.exit(PCA_CMD())
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/scanpy_scripts/cmd_utils.py", line 40, in cmd
    func(adata, **kwargs)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/scanpy_scripts/lib/_pca.py", line 27, in pca
    sc.pp.pca(adata, **kwargs)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py", line 504, in pca
    X_pca = pca_.fit_transform(X)
  File "/nfs/production3/ma/home/gxa_galaxy/_conda/envs/[email protected]/lib/python3.6/site-packages/sklearn/decomposition/truncated_svd.py", line 175, in fit_transform
    " got %d >= %d" % (k, n_features))
ValueError: n_components must be < n_features; got 32 >= 15

... and here's the input object:

test.h5.zip

@pcm32
Copy link
Member

pcm32 commented Feb 19, 2024

Ahh, ths was merged on another PR after (#78).

@pcm32 pcm32 closed this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants