Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exon counts when there are genes on both strands #253

Open
skagawa2 opened this issue Oct 22, 2024 · 4 comments
Open

Exon counts when there are genes on both strands #253

skagawa2 opened this issue Oct 22, 2024 · 4 comments
Assignees
Labels
algorithm Issue requires algorithmic improvement enhancement New feature or request

Comments

@skagawa2
Copy link

Hello! We were looking at the exon counts again and saw some discrepancies from what we expected. This one comes from reads going on one strand counting as being excluded from the gene on the opposite strand. The example we found was with this gene below- F2RL2 and IQGAP2 (and we presume there are a handful more that get caught under this case). We saw that in exon_counts.tsv, the exclude counts for F2RL2 exons a sample that expressed IQGAP2 and not F2RL2 was high. This wasn't what we expected to happen when an exon is included or excluded since our reads are stranded.

image

We came up with the following patch that we think fixes this problem, but we're not sure if this messes anything else up in unexpected ways (like the transcript classification or other parts of the pipeline). If this was also intended (that exons aren't stranded), please let us know as well!

diff --git a/src/alignment_processor.py b/src/alignment_processor.py
index 8644a47..2833991 100644
--- a/src/alignment_processor.py
+++ b/src/alignment_processor.py
@@ -272,23 +272,23 @@ class AlignmentCollector:
         split_regions = AlignmentCollector.split_coverage_regions(current_region, alignment_storage)
 
         if len(split_regions) == 1:
-            yield self.process_alignments_in_region(current_region, alignment_storage.get_alignments())
+            current_region, alignments, strand = split_regions[0]
+            yield self.process_alignments_in_region(current_region, alignments, strand)
         else:
-            for new_region in split_regions:
-                alignments = alignment_storage.get_alignments(new_region)
-                yield self.process_alignments_in_region(new_region, alignments)
+            for new_region, alignments, strand in split_regions:
+                yield self.process_alignments_in_region(new_region, alignments, strand)
 
-    def process_alignments_in_region(self, current_region, alignment_storage):
-        logger.debug("Processing region %s" % str(current_region))
-        gene_info = self.get_gene_info_for_region(current_region)
+    def process_alignments_in_region(self, current_region, alignment_storage, strand):
+        logger.debug("Processing region %s on strand %s" % (str(current_region), strand))
+        gene_info = self.get_gene_info_for_region(current_region, strand=strand)
         if gene_info.empty():
-            assignment_storage = self.process_intergenic(alignment_storage, current_region)
+            assignment_storage = self.process_intergenic(alignment_storage, current_region, strand)
         else:
-            assignment_storage = self.process_genic(alignment_storage, gene_info, current_region)
+            assignment_storage = self.process_genic(alignment_storage, gene_info, current_region, strand)
 
         return gene_info, assignment_storage
 
-    def process_intergenic(self, alignment_storage, region):
+    def process_intergenic(self, alignment_storage, region, strand):
         assignment_storage = []
         if self.illumina_bam is not None:
             corrector = IlluminaExonCorrector(self.chr_id, region[0], region[1], self.illumina_bam)
@@ -310,6 +310,10 @@ class AlignmentCollector:
                 logger.warning("Read %s has no aligned exons" % read_id)
                 continue
 
+            mapped_strand = "-" if alignment.is_reverse else "+"
+            if strand != '.' and mapped_strand != strand:
+                continue
+
             #if len(alignment_info.read_exons) > 2 and not alignment.is_secondary and \
             #        alignment.mapping_quality < self.params.multi_intron_mapping_quality_cutoff:
             #    continue
@@ -338,7 +342,7 @@ class AlignmentCollector:
             read_assignment.corrected_introns = junctions_from_blocks(read_assignment.corrected_exons)
 
             read_assignment.read_group = self.read_groupper.get_group_id(alignment, self.bam_merger.bam_pairs[bam_index][1])
-            read_assignment.mapped_strand = "-" if alignment.is_reverse else "+"
+            read_assignment.mapped_strand = mapped_strand
             read_assignment.strand = self.get_assignment_strand(read_assignment)
             read_assignment.chr_id = self.chr_id
             read_assignment.multimapper = alignment.is_secondary
@@ -347,7 +351,7 @@ class AlignmentCollector:
             assignment_storage.append(read_assignment)
         return assignment_storage
 
-    def process_genic(self, alignment_storage, gene_info, region):
+    def process_genic(self, alignment_storage, gene_info, region, strand):
         assigner = LongReadAssigner(gene_info, self.params)
         profile_constructor = CombinedProfileConstructor(gene_info, self.params)
         exon_corrector = ExonCorrector(gene_info, self.params, self.chr_record)
@@ -369,6 +373,10 @@ class AlignmentCollector:
                 logger.warning("Read %s has no aligned exons" % read_id)
                 continue
 
+            mapped_strand = "-" if alignment.is_reverse else "+"
+            if strand != '.' and mapped_strand != strand:
+                continue
+
             alignment_info.add_polya_info(self.polya_finder, self.polya_fixer)
             if self.params.cage:
                 alignment_info.add_cage_info(self.cage_finder)
@@ -396,7 +404,7 @@ class AlignmentCollector:
             read_assignment.corrected_introns = junctions_from_blocks(read_assignment.corrected_exons)
 
             read_assignment.read_group = self.read_groupper.get_group_id(alignment, self.bam_merger.bam_pairs[bam_index][1])
-            read_assignment.mapped_strand = "-" if alignment.is_reverse else "+"
+            read_assignment.mapped_strand = mapped_strand
             read_assignment.strand = self.get_assignment_strand(read_assignment)
             AlignmentCollector.check_antisense(read_assignment)
             AlignmentCollector.import_bam_tags(alignment, read_assignment, self.params.bam_tags)
@@ -458,13 +466,13 @@ class AlignmentCollector:
 
         return indel_count, junctions_with_indels
 
-    def get_gene_info_for_region(self, current_region):
+    def get_gene_info_for_region(self, current_region, strand='.'):
         if not self.genedb:
             return GeneInfo.from_region(self.chr_id, current_region[0], current_region[1],
                                         self.params.delta, self.chr_record)
 
         gene_list = list(self.genedb.region(seqid=self.chr_id, start=current_region[0],
-                                            end=current_region[1], featuretype="gene"))
+                                            end=current_region[1], featuretype="gene", strand=strand))
         if not gene_list:
             return GeneInfo.from_region(self.chr_id, current_region[0], current_region[1],
                                         self.params.delta, self.chr_record)
@@ -478,7 +486,19 @@ class AlignmentCollector:
     def split_coverage_regions(genomic_region, alignment_storage):
         if interval_len(genomic_region) < AlignmentCollector.MAX_REGION_LEN and \
                 alignment_storage.get_read_count() < AlignmentCollector.MIN_READS_TO_SPLIT:
-            return [genomic_region]
+            pos_strand_alignments = []
+            neg_strand_alignments = []
+            for alignment in alignment_storage.get_alignments():
+                if not alignment[1].is_reverse:
+                    pos_strand_alignments.append(alignment)
+                else:
+                    neg_strand_alignments.append(alignment)
+            split_regions = []
+            if pos_strand_alignments:
+                split_regions.append((genomic_region, pos_strand_alignments, '+'))
+            if neg_strand_alignments:
+                split_regions.append((genomic_region, neg_strand_alignments, '-'))
+            return split_regions
 
         split_regions = []
         coverage_dict = alignment_storage.coverage_dict
@@ -492,8 +512,17 @@ class AlignmentCollector:
                     coverage_dict[pos] > max(AlignmentCollector.ABS_COV_VALLEY, max_cov * AlignmentCollector.REL_COV_VALLEY):
                 max_cov = max(max_cov, coverage_dict[pos])
                 pos += 1
-            split_regions.append((max(current_start * AbstractAlignmentStorage.COVERAGE_BIN + 1, genomic_region[0]),
-                                  min(pos * AbstractAlignmentStorage.COVERAGE_BIN, genomic_region[1])))
+            split_region = (max(current_start * AbstractAlignmentStorage.COVERAGE_BIN + 1, genomic_region[0]),
+                            min(pos * AbstractAlignmentStorage.COVERAGE_BIN, genomic_region[1]))
+            alignments = list(alignment_storage.get_alignments(split_region))
+            pos_strand_alignments = [(bam_index, aln) for bam_index, aln in alignments if not aln.is_reverse]
+            neg_strand_alignments = [(bam_index, aln) for bam_index, aln in alignments if aln.is_reverse]
+
+            if pos_strand_alignments:
+                split_regions.append((split_region, pos_strand_alignments, '+'))
+            if neg_strand_alignments:
+                split_regions.append((split_region, neg_strand_alignments, '-'))
+
             current_start = pos
             max_cov = coverage_dict[current_start]
             pos = min(current_start + 1, coverage_positions[-1] + 1)

Once again, thank you for creating this software!

@andrewprzh
Copy link
Collaborator

Dear @skagawa2

Thank you for taking your time in digging into the code so deeply!

Correct if I'm wrong, but this method will separate alignments by their mapping strand reported by minimap2, right?
If so, BAM file strand flag indicates whether a read sequence was reverse-complemented during mapping. It can be used if you are working with stranded protocols (i.e. dRNA), but in more frequent cDNA data reads are unstranded, and thus mapping flag is rather arbitrary since read split ~50/50.

In situation I think it will correct to use the assigned read strand, which IsoQuant computes using:

  • presence of polyA tail or polyT head
  • canonical splice sites
  • strand of assigned gene for consistent reads

It does make algorithm more complicated, but it's certainly the right way for overlapping genes from the opposite strands. I'll if I can fix this easily.

Best
Andrey

@andrewprzh andrewprzh self-assigned this Oct 25, 2024
@andrewprzh andrewprzh added enhancement New feature or request algorithm Issue requires algorithmic improvement labels Oct 25, 2024
@skagawa2
Copy link
Author

No problem! And yes you are correct, this method would separate alignments by their mapping strand. We happened to have a stranded library with trimmed polyA tails so I hadn't considered other cases and assumed you would know best haha. Please feel free to take this suggestion in any way you please, these changes just happened to work in our case.

@Geil625
Copy link

Geil625 commented Dec 25, 2024

Hi,
I would like to ask a question following this issue. When I used --d nanopore, does the --stranded flag automatically being set to "none"? Also what is the default value of --stranded?
Thanks

@andrewprzh
Copy link
Collaborator

@Geil625

Yes, the default value for --stranded is none for all data types, it needs to be specified explicitly if the data is stranded.

Best
Andrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm Issue requires algorithmic improvement enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants