swappedDrops on STARSolo output #59

mbatiuk · 2021-02-16T23:03:49Z

Hi,

Is there any chance swappedDrops could be run on STARSolo output?

Currently STARSolo outputs barcodes.tsv, features.tsv and matrix.mtx but no h5 files.
While swappedDrops requires h5 file

Thanks

LTLA · 2021-02-16T23:14:13Z

tl;dr No.

The long answer is that you can use removeSwappedDrops() to run the same algorithm on any molecule-level information, regardless of format. The emphasis here is molecule-level; you need the UMI, assigned gene and assigned cell for each individual transcript molecule. This is what is returned by Cellranger in their molecule information HDF5 file.

It is not sufficient to run this on the count matrix, which aggregates the molecules into counts at the gene level. This discards information about individual molecules, preventing swappedDrops() from making a decision about whether a particular molecule is a swapping artifact. I'm not familiar enough with Starsolo to know whether molecule-level stats are generated.

mbatiuk · 2021-02-17T12:51:22Z

OK, thanks for your response. STARSolo can output BAM/SAM files but not HDF5

I will ask STAR developers if there is any way to get HDF5 or maybe add this functionality

LTLA · 2021-02-17T23:32:33Z

Just so we're clear: the HDF5 file format is not a requirement. The requirement is just to get molecule-level information. Cellranger happens to store this in a HDF5 format, but processing pipelines are free to use whatever format they want; as long as you can get the relevant pieces of information into R, you can call removeSwappedDrops() on it. Of course, if the pipeline produced a Cellranger-style HDF5 file, this would be easiest for everyone, but I won't presume to dictate formats to others.

mbatiuk · 2021-02-18T10:29:47Z

OK, I corrected question to STAR developers to be more broad than HDF5

On the other hand, could BAM/SAM files provide this info?

OR if I understand correctly, a lot happens during read count/collapsing UMI barcodes and correcting for UMI barcode errors that STAR performs so SAM is not a good source of molecule info

LTLA · 2021-02-18T16:56:39Z

On the other hand, could BAM/SAM files provide this info?

Possibly. I've seen other cases where the processing pipeline deposits the identity of the assigned gene and cell barcodes into the SAM tags. One could also store the UMI sequence and indicate whether a particular read is a duplicate for a molecule.

In practice, custom BAM tags are a pain to work with; this would need to be handled on a case-by-case basis.

This was referenced Feb 17, 2021

. #60

Closed

HDF5/each individual transcript molecule info in STARSolo output alexdobin/STAR#1148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swappedDrops on STARSolo output #59

swappedDrops on STARSolo output #59

mbatiuk commented Feb 16, 2021

LTLA commented Feb 16, 2021 •

edited

Loading

mbatiuk commented Feb 17, 2021

LTLA commented Feb 17, 2021

mbatiuk commented Feb 18, 2021 •

edited

Loading

LTLA commented Feb 18, 2021

swappedDrops on STARSolo output #59

swappedDrops on STARSolo output #59

Comments

mbatiuk commented Feb 16, 2021

LTLA commented Feb 16, 2021 • edited Loading

mbatiuk commented Feb 17, 2021

LTLA commented Feb 17, 2021

mbatiuk commented Feb 18, 2021 • edited Loading

LTLA commented Feb 18, 2021

LTLA commented Feb 16, 2021 •

edited

Loading

mbatiuk commented Feb 18, 2021 •

edited

Loading