Skip to content

Commit

Permalink
Generalize the counts attribute of an Alignment (biopython#4924)
Browse files Browse the repository at this point in the history
* create template

* tests

* tests

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* stockholm

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: Michiel de Hoon <[email protected]>
Co-authored-by: Michiel Jan Laurens de Hoon <[email protected]>
  • Loading branch information
3 people authored Feb 4, 2025
1 parent c58b632 commit e838eb9
Show file tree
Hide file tree
Showing 24 changed files with 22,085 additions and 44 deletions.
359 changes: 335 additions & 24 deletions Bio/Align/__init__.py

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion Doc/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Read the `Sphix reST Primer <https://www.sphinx-doc.org/en/master/usage/restruct
Building the documentation
--------------------------

Build the documentation in HTML by running ``make html`` in the Docs folder.
Build the documentation in HTML by running ``make html`` in the Doc folder.
Make sure that you have installed necessary requirements (see instructions above).

To view the generated documentation, open ``Doc/_build/html/index.html`` in your browser.
Expand Down
156 changes: 139 additions & 17 deletions Doc/Tutorial/chapter_align.rst
Original file line number Diff line number Diff line change
Expand Up @@ -576,18 +576,14 @@ alignment are indicated by -1:
array([ 0, 1, 3, 4, 5, -1, -1]),
array([0, 1, 2, 3, 4, 5])]
.. _`paragraph:alignment_counts`:

Counting identities, mismatches, and gaps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``counts`` method calculates the number of identities, mismatches,
and gaps of a pairwise alignment. For an alignment of more than two
sequences, the number of identities, mismatches, and gaps are calculated
and summed for all pairs of sequences in the alignment. The three
numbers are returned as an ``AlignmentCounts`` object, which is a
``namedtuple`` with fields ``gaps``, ``identities``, and ``mismatches``.
This method currently takes no arguments, but in the future will likely
be modified to accept optional arguments allowing its behavior to be
customized.
The ``counts`` method counts the number of identities, mismatches, and gaps
(insertions and deletions) of an alignment. The return value is an
``AlignmentCounts`` object, from which the counts can be obtained as properties.

.. cont-doctest
Expand All @@ -598,15 +594,130 @@ customized.
0 .|-|||-- 8
query 0 AG-TTT-- 5
<BLANKLINE>
>>> pairwise_alignment.counts()
AlignmentCounts(gaps=3, identities=4, mismatches=1)
>>> counts = pairwise_alignment.counts()
>>> counts.aligned
5
>>> counts.identities
4
>>> counts.mismatches
1
>>> counts.gaps
3
>>> counts.insertions
0
>>> counts.deletions
3
>>> counts.internal_deletions
1
>>> counts.right_deletions
2
For an alignment of more than two sequences, the number of identities,
mismatches, and gaps are calculated and summed for all pairs of sequences in
the alignment.

.. cont-doctest
.. code:: pycon
>>> print(alignment)
1 CGGTTTTT 9
0 AG-TTT-- 5
0 AGGTTT-- 6
<BLANKLINE>
>>> alignment.counts()
AlignmentCounts(gaps=6, identities=14, mismatches=2)
>>> counts = alignment.counts()
>>> counts.aligned
16
>>> counts.identities
14
>>> counts.mismatches
2
>>> counts.insertions
1
>>> counts.deletions
5
Here, insertions are defined as sequence insertions of a later sequence into an
earlier sequence in the alignment. In contrast to the pairwise alignment above,
the distinction between insertions and deletions may not be meaningful for a
multiple sequence alignment, and you will probably be more interested in the
number of gaps (= insertions + deletions):

.. cont-doctest
.. code:: pycon
>>> counts.gaps
6
>>> counts.left_gaps
0
>>> counts.right_gaps
4
>>> counts.internal_gaps
2
For protein alignments, in addition to the number of identities and mismatches,
you can also count the number of positive matches by supplying a substitution
matrix (see Chapter :ref:`sec:substitution_matrices`):

.. cont-doctest
.. code:: pycon
>>> from Bio.Align import substitution_matrices
>>> substitution_matrix = substitution_matrices.load("BLOSUM62")
>>> protein_alignment = Alignment(["GQCGSCWSFS", "GACGSCWTFS"])
>>> print(protein_alignment)
target 0 GQCGSCWSFS 10
0 |.|||||.|| 10
query 0 GACGSCWTFS 10
<BLANKLINE>
>>> counts = protein_alignment.counts(substitution_matrix)
>>> counts.aligned
10
>>> counts.identities
8
>>> counts.mismatches
2
>>> counts.positives
9
where the number of positives is one higher than the number of identities
because the mismatch of S against T has a positive score (while the mismatch
score of Q against A is not positive):

.. cont-doctest
.. code:: pycon
>>> substitution_matrix["S", "T"]
1.0
>>> substitution_matrix["Q", "A"]
-1.0
An ``AlignmentCounts`` object has the following properties:

========================= =============================================================================================================
**property** **description**
========================= =============================================================================================================
``aligned`` The number of aligned letters in the alignment. This quantity is also calculated if some or all of the sequences are undefined. If all sequences are known, then ``aligned`` = ``identities`` + ``mismatches``. If some sequences are undefined, then ``aligned`` > ``identities`` + ``mismatches``.
``identities`` The number of identical letters in the alignment
``mismatches`` The number of mismatched letters in the alignment
``positives`` The number of aligned letters with a positive score
``left_insertions`` The number of insertions on the left side of the alignment
``left_deletions`` The number of deletions on the left side of the alignment
``right_insertions`` The number of insertions on the right side of the alignment
``right_deletions`` The number of deletions on the right side of the alignment
``internal_insertions`` The number of insertions in the interior of the alignment
``internal_deletions`` The number of deletions in the interior of the alignment
``insertions`` The total number of insertions, equal to ``left_insertions`` + ``right_insertions`` + ``internal_insertions``
``deletions`` The total number of deletions, equal to ``left_deletions`` + ``right_deletions`` + ``internal_deletions``
``left_gaps`` The number of gaps on the left side of the alignment, equal to ``left_insertions`` + ``left_deletions``
``right_gaps`` The number of gaps on the right side of the alignment, equal to ``right_insertions`` + ``right_deletions``
``internal_gaps`` The number of gaps in the interior of the alignment, equal to ``internal_insertions`` + ``internal_deletions``
``gaps`` The total number of gaps in the alignment, equal to ``left_gaps`` + ``right_gaps`` + ``internal_gaps``
========================= =============================================================================================================


Letter frequencies
~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -2574,11 +2685,22 @@ calling the ``counts`` method on the ``alignment`` object:
.. code:: pycon
>>> alignment.counts()
AlignmentCounts(gaps=19, identities=112, mismatches=0)
>>> counts = alignment.counts()
where ``AlignmentCounts`` is a ``namedtuple`` in the ``collections``
module in Python’s standard library.
where the ``counts`` variable is an ``AlignmentCounts`` object collecting
information on the number of gaps, matches, and mismatches in the alignment
library (see :ref:`paragraph:alignment_counts`)):

.. cont-doctest
.. code:: pycon
>>> counts.identities
112
>>> counts.mismatches
0
>>> counts.gaps
19
The consensus line shown between the two sequences is stored in the
``column_annotations`` attribute:
Expand Down
21 changes: 21 additions & 0 deletions Tests/test_Align_Alignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -2366,6 +2366,27 @@ def test_substitutions(self):
self.assertAlmostEqual(m["C", "T"], 14.0)
self.assertAlmostEqual(m["T", "C"], 14.0)

def test_counts(self):
from Bio.Align import substitution_matrices

substitution_matrix = substitution_matrices.load("BLOSUM62")
alignment = self.alignment
counts = alignment.counts()
self.assertEqual(
str(counts),
"AlignmentCounts(left_insertions=0, left_deletions=0, internal_insertions=0, internal_deletions=0, right_insertions=80, right_deletions=4, aligned=3084, identities=3020, mismatches=64, positives=None)",
)
counts = alignment.counts(gaps_only=True)
self.assertEqual(
str(counts),
"AlignmentCounts(left_insertions=0, left_deletions=0, internal_insertions=0, internal_deletions=0, right_insertions=80, right_deletions=4, aligned=3084, identities=None, mismatches=None, positives=None)",
)
with self.assertRaises(ValueError):
alignment.counts(substitution_matrix=substitution_matrix, gaps_only=True)
for i, sequence in enumerate(alignment.sequences):
length = len(sequence)
alignment.sequences[i] = Seq(None, length)

def test_add(self):
self.assertEqual(
str(self.alignment[:, 50:60]),
Expand Down
86 changes: 86 additions & 0 deletions Tests/test_Align_a2m.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
import numpy as np

from Bio import Align
from Bio.Align import substitution_matrices

substitution_matrix = substitution_matrices.load("BLOSUM62")


class TestA2MReadingWriting(unittest.TestCase):
Expand Down Expand Up @@ -142,6 +145,23 @@ def check_clustalw(self, alignment):
---------MSPQTETKASVGFKAGVKEYKLTYYTPEYETKDTDILAAFRVTPQPG-----------------VPPEEAGAAVAAESSTGT---------WTTVWTDGLTSLDRYKG-----RCYHIEPVPG-------------------EKDQCICYVAYPLDLFEEGSVTNMFTSIVGNVFGFKALRALRLEDLRIPVAYVKTFQGPPHGIQVERDKLNKYGRPLLGCTIKPKLGLSAKNYGRAVYECLRGGLDFTKDDENVNSQPFMRWRDRFLFCAEAIYKAQAETGEIKGHYLNATAG-----------------------TCEEMIKRAIFARELGVPIVMHDYLTGGFTANTSLAHYCRDNGLLLHIHRAMHAVIDRQKNHGMHFRVLAKALRLSGGDHIHSGTVVGKLEGERDITLGFVDLLRDDFIEKDRSRGIYFTQDWVSLPGVIPVASG-----------------------------GIHVWHMPALTEIFGDDSVLQFGGGTLGHPWGNAPGAVANRVA-----------VEACVKARNEG---RDLAAEGNAIIREACKWSPElAAACEVWKEIKFEFPAMD---
""",
)
counts = alignment.counts(substitution_matrix)
self.assertEqual(counts.left_insertions, 0)
self.assertEqual(counts.left_deletions, 9)
self.assertEqual(counts.right_insertions, 0)
self.assertEqual(counts.right_deletions, 3)
self.assertEqual(counts.internal_insertions, 1)
self.assertEqual(counts.internal_deletions, 116)
self.assertEqual(counts.left_gaps, 9)
self.assertEqual(counts.right_gaps, 3)
self.assertEqual(counts.internal_gaps, 117)
self.assertEqual(counts.insertions, 1)
self.assertEqual(counts.deletions, 128)
self.assertEqual(counts.gaps, 129)
self.assertEqual(counts.aligned, 472)
self.assertEqual(counts.identities, 64)
self.assertEqual(counts.mismatches, 408)
self.assertEqual(counts.positives, 126)

def test_msaprobs(self):
path = "Clustalw/msaprobs.a2m"
Expand Down Expand Up @@ -314,6 +334,23 @@ def test_msaprobs(self):
MKKLvl......SLS....LV---lA..FSSAta...............a.FAAIPQNIRIGTDPTYAPFESKNs.QGELVGFDIDLAKELCKRINTQCTFVENPLDALIPSLKAKKIDAIMSSLSITEKRQQEIAFTDKLYAADSRLVVAK.NSDIQPTVESLKGKRVGVLQGTTQETFGNEHWAPKGIEIVSYQGqdNIYSDLTAGRIDAAFQDEVAASEgFLKQPVgKDYKFGGPSVKdeklfGVGTGMGLRK--EDNELREALNKAFAEMRADGTYEKLAKKYFDFDVYG...g
""",
)
counts = alignment.counts(substitution_matrix)
self.assertEqual(counts.left_insertions, 0)
self.assertEqual(counts.left_deletions, 0)
self.assertEqual(counts.right_insertions, 11)
self.assertEqual(counts.right_deletions, 19)
self.assertEqual(counts.internal_insertions, 285)
self.assertEqual(counts.internal_deletions, 293)
self.assertEqual(counts.left_gaps, 0)
self.assertEqual(counts.right_gaps, 30)
self.assertEqual(counts.internal_gaps, 578)
self.assertEqual(counts.insertions, 296)
self.assertEqual(counts.deletions, 312)
self.assertEqual(counts.gaps, 608)
self.assertEqual(counts.aligned, 6948)
self.assertEqual(counts.identities, 2353)
self.assertEqual(counts.mismatches, 4595)
self.assertEqual(counts.positives, 3745)
self.check_reading_writing(path)

def test_muscle(self):
Expand Down Expand Up @@ -422,6 +459,22 @@ def test_muscle(self):
.................................................................---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ATGAACAAAGTAGCGAGGAAGAA------------------------------CAAAACATC----------------------------------------------------------------------------------------------------------------------------------------------------------------------------AGCAAAGAAAACGATCTGTCTCCGTCGTAACACACAGTTTTTCGAGACCCTTTGCTTCTTCGGCGCCGGTGGACACGTCAGCATCTCCGGTATCCTAGACTTCTTGGCTTTCGGGGTACAACAACCGCCTGGTGACGTCAGCACCGCTGCTGGGGATGGAGAGGGAACAGAGTAg
""",
)
counts = alignment.counts()
self.assertEqual(counts.left_insertions, 65)
self.assertEqual(counts.left_deletions, 491)
self.assertEqual(counts.right_insertions, 2)
self.assertEqual(counts.right_deletions, 0)
self.assertEqual(counts.internal_insertions, 0)
self.assertEqual(counts.internal_deletions, 404)
self.assertEqual(counts.left_gaps, 556)
self.assertEqual(counts.right_gaps, 2)
self.assertEqual(counts.internal_gaps, 404)
self.assertEqual(counts.insertions, 67)
self.assertEqual(counts.deletions, 895)
self.assertEqual(counts.gaps, 962)
self.assertEqual(counts.aligned, 1034)
self.assertEqual(counts.identities, 974)
self.assertEqual(counts.mismatches, 60)
self.check_reading_writing(path)

def test_kalign(self):
Expand Down Expand Up @@ -475,6 +528,22 @@ def test_kalign(self):
# fmt: on
)
)
counts = alignment.counts()
self.assertEqual(counts.left_insertions, 0)
self.assertEqual(counts.left_deletions, 0)
self.assertEqual(counts.right_insertions, 0)
self.assertEqual(counts.right_deletions, 0)
self.assertEqual(counts.internal_insertions, 1)
self.assertEqual(counts.internal_deletions, 0)
self.assertEqual(counts.left_gaps, 0)
self.assertEqual(counts.right_gaps, 0)
self.assertEqual(counts.internal_gaps, 1)
self.assertEqual(counts.insertions, 1)
self.assertEqual(counts.deletions, 0)
self.assertEqual(counts.gaps, 1)
self.assertEqual(counts.aligned, 26)
self.assertEqual(counts.identities, 25)
self.assertEqual(counts.mismatches, 1)
self.check_reading_writing(path)

def test_probcons(self):
Expand Down Expand Up @@ -618,6 +687,23 @@ def test_probcons(self):
# fmt: on
)
)
counts = alignment.counts(substitution_matrix)
self.assertEqual(counts.left_insertions, 8)
self.assertEqual(counts.left_deletions, 2)
self.assertEqual(counts.right_insertions, 0)
self.assertEqual(counts.right_deletions, 0)
self.assertEqual(counts.internal_insertions, 14)
self.assertEqual(counts.internal_deletions, 48)
self.assertEqual(counts.left_gaps, 10)
self.assertEqual(counts.right_gaps, 0)
self.assertEqual(counts.internal_gaps, 62)
self.assertEqual(counts.insertions, 22)
self.assertEqual(counts.deletions, 50)
self.assertEqual(counts.gaps, 72)
self.assertEqual(counts.aligned, 904)
self.assertEqual(counts.identities, 427)
self.assertEqual(counts.mismatches, 477)
self.assertEqual(counts.positives, 554)
self.check_reading_writing(path)

def test_empty(self):
Expand Down
Loading

0 comments on commit e838eb9

Please sign in to comment.