Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swappedDrops on STARSolo output #59

Open
mbatiuk opened this issue Feb 16, 2021 · 5 comments
Open

swappedDrops on STARSolo output #59

mbatiuk opened this issue Feb 16, 2021 · 5 comments

Comments

@mbatiuk
Copy link

mbatiuk commented Feb 16, 2021

Hi,

Is there any chance swappedDrops could be run on STARSolo output?

Currently STARSolo outputs barcodes.tsv, features.tsv and matrix.mtx but no h5 files.
While swappedDrops requires h5 file

Thanks

@LTLA
Copy link
Collaborator

LTLA commented Feb 16, 2021

tl;dr No.

The long answer is that you can use removeSwappedDrops() to run the same algorithm on any molecule-level information, regardless of format. The emphasis here is molecule-level; you need the UMI, assigned gene and assigned cell for each individual transcript molecule. This is what is returned by Cellranger in their molecule information HDF5 file.

It is not sufficient to run this on the count matrix, which aggregates the molecules into counts at the gene level. This discards information about individual molecules, preventing swappedDrops() from making a decision about whether a particular molecule is a swapping artifact. I'm not familiar enough with Starsolo to know whether molecule-level stats are generated.

@mbatiuk
Copy link
Author

mbatiuk commented Feb 17, 2021

OK, thanks for your response. STARSolo can output BAM/SAM files but not HDF5

I will ask STAR developers if there is any way to get HDF5 or maybe add this functionality

@LTLA
Copy link
Collaborator

LTLA commented Feb 17, 2021

Just so we're clear: the HDF5 file format is not a requirement. The requirement is just to get molecule-level information. Cellranger happens to store this in a HDF5 format, but processing pipelines are free to use whatever format they want; as long as you can get the relevant pieces of information into R, you can call removeSwappedDrops() on it. Of course, if the pipeline produced a Cellranger-style HDF5 file, this would be easiest for everyone, but I won't presume to dictate formats to others.

@mbatiuk
Copy link
Author

mbatiuk commented Feb 18, 2021

OK, I corrected question to STAR developers to be more broad than HDF5

On the other hand, could BAM/SAM files provide this info?

OR if I understand correctly, a lot happens during read count/collapsing UMI barcodes and correcting for UMI barcode errors that STAR performs so SAM is not a good source of molecule info

@LTLA
Copy link
Collaborator

LTLA commented Feb 18, 2021

On the other hand, could BAM/SAM files provide this info?

Possibly. I've seen other cases where the processing pipeline deposits the identity of the assigned gene and cell barcodes into the SAM tags. One could also store the UMI sequence and indicate whether a particular read is a duplicate for a molecule.

In practice, custom BAM tags are a pain to work with; this would need to be handled on a case-by-case basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants