-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTRex.tex
2118 lines (1574 loc) · 277 KB
/
TRex.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% ELIFE ARTICLE TEMPLATE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% PREAMBLE
\documentclass[9pt,lineno]{elife}
% Use the onehalfspacing option for 1.5 line spacing
% Use the doublespacing option for 2.0 line spacing
% Please note that these options may affect formatting.
% Additionally, the use of the \newcommand function should be limited.
\usepackage{algorithm2e}
\usepackage{mathtools}
\usepackage{color, colortbl}
\usepackage{array} % for defining a new column type
\usepackage{varwidth} %for the varwidth minipage environment
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% ARTICLE SETUP
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields}
\author[1,2,3*]{Tristan Walter}
\author[1,2,3*]{Iain D Couzin}
\affil[1]{Max Planck Institute of Animal Behavior, Germany}
\affil[2]{Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Germany}
\affil[3]{Department of Biology, University of Konstanz, Germany}
\corr{[email protected]}{TW}
\corr{[email protected]}{IDC}
%\presentadd[\authfn{1}]{Department of Collective Behaviour, Max Planck Institute of Animal Behavior, D-78457 Konstanz, Germany}
%\presentadd[\authfn{2}]{Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Universitätsstraße 10, D-78457 Konstanz, Germany}
%\presentadd[\authfn{3}]{Department of Biology, University of Konstanz, Universitätsstraße 10, D-78457 Konstanz, Germany}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% ARTICLE START
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\figref}[1]{\textit{\textbf{\ref{#1}}}}
\newcommand{\vidref}[1]{\textit{\textbf{\ref{#1}}}}
\newcommand{\tableref}[1]{\textit{\textbf{\ref{tab:#1}}}\xspace}
\newcommand{\videoref}[1]{video~\textit{\textbf{\ref{#1}}}}
%\DeclarePairedDelimiterX\set[1]\lbrace\rbrace{\def\given{\;\delimsize\vert\;}#1}
\NewDocumentCommand{\up}{som}{%
\IfBooleanTF{#1}
{\upext{#3}}
{#3\IfNoValueTF{#2}{\mathord}{#2}\uparrow}%
}
\NewDocumentCommand{\upext}{m}{%
\mleft.\kern-\nulldelimiterspace#1\mright\uparrow
}
\DeclarePairedDelimiterX{\given}[1]{(}{)}{%
\ifnum\currentgrouptype=16 \else\begingroup\fi
\activatebar#1
\ifnum\currentgrouptype=16 \else\endgroup\fi
}
\DeclarePairedDelimiterX{\givenset}[1]{\{}{\}}{%
\ifnum\currentgrouptype=16 \else\begingroup\fi
\activatebar#1
\ifnum\currentgrouptype=16 \else\endgroup\fi
}
\newcommand{\idtracker}{\protect\path{ idtracker.ai}}
\newcommand{\TRex}{\protect\path{TRex}}
\newcommand{\TGrabs}{\protect\path{TGrabs}}
\definecolor{Gray}{gray}{0.9}
\newcommand{\innermid}{\nonscript\;\vert\nonscript\;} % \delimsize
\newcommand{\activatebar}{%
\begingroup\lccode`\~=`\|
\lowercase{\endgroup\let~}\innermid
\mathcode`|=\string"8000
}
\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\newcommand{\expnumber}[2]{{#1}\mathrm{e}{#2}}
\DeclarePairedDelimiter{\nint}\lfloor\rceil
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{arg}\,\operatorname{max}}\;}
\newcommand*\mean[1]{\bar{#1}}
\newcommand{\Tau}{\mathcal{T}}
\newcommand{\direction}[1]{\overrightarrow{#1}\;}
\DeclareMathOperator*{\median}{median}
\DeclareMathOperator{\atantwo}{atan2}
\renewcommand{\thefigure}{Figure~\arabic{figure}}
\captionsetup*[figure]{name={\hspace{-2.5pt}},font={color=eLifeDarkBlue,small},skip=\smallskipamount,justification=justified}
\captionsetup*[table]{name={\hspace{-2.5pt}},font={color=eLifeDarkBlue,small},margin=0pt,indention=0cm,justification=justified}
\renewcommand{\thetable}{Table~\arabic{table}}
\newcommand{\changemade}[1]{#1}
%\renewcommand{\changemade}[1]{{\color{blue}#1}}
\makeatletter\newcommand\newtag[2]{#1\def\@currentlabel{#1\hspace{-2pt}}\label{#2}}\makeatother%
\makeatletter
\newcommand*{\inlineequation}[2][]{%
\begingroup
% Put \refstepcounter at the beginning, because
% package `hyperref' sets the anchor here.
\refstepcounter{equation}%
\ifx\\#1\\%
\else
\label{#1}%
\fi
% prevent line breaks inside equation
\relpenalty=10000 %
\binoppenalty=10000 %
\ensuremath{%
% \displaystyle % larger fractions, ...
#2%
}%
~\@eqnnum
\endgroup
}
\makeatother
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% INTRODUCTION AND ABSTRACT
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle
\begin{abstract}
Automated visual tracking of animals is rapidly becoming an indispensable tool for the study of behavior. It offers a quantitative methodology by which organisms' sensing and decision-making can be studied in a wide range of ecological contexts. Despite this, existing solutions tend to be challenging to deploy in practice, especially when considering long and/or high-resolution video\changemade{-}streams. Here, we present TRex, a fast and easy-to-use solution for tracking a large number of individuals simultaneously \changemade{using background-subtraction} with real-time (60Hz) tracking performance for up to approximately 256 individuals and estimates 2D \changemade{visual-fields, outlines, and head/rear of bilateral animals}, both in open and closed-loop contexts. Additionally, TRex offers highly-accurate, deep-learning-based visual identification of up to approximately 100 unmarked individuals, where it is between 2.5-46.7 times faster, and requires 2-10 times less memory, than comparable software (with relative performance increasing for more organisms\changemade{/}longer videos) and provides interactive data-exploration within an intuitive, platform-independent graphical user\changemade{-}interface.
\end{abstract}
\section{Introduction}
Tracking multiple moving animals (and multiple objects, generally) is important in various fields of research such as behavioral studies, ecophysiology, biomechanics, and neuroscience (\cite{dell2014automated}). Many tracking algorithms have been proposed in recent years (\cite{ohayon2013automated}, \cite{fukunaga2015grouptracker}, \cite{burgos2012social}, \cite{rasch2016closing}), often limited to/only tested with a particular organism (\cite{hewitt2018novel}, \cite{branson2009high}) or type of organism (e.g. protists, \cite{pennekamp2015bemovi}; fly larvae and worms, \cite{risse2017fimtrack}). Relatively few have been tested with a range of organisms and scenarios (\cite{idtracker}, \cite{sridhar2019tracktor}, \cite{rodriguez2018toxtrac}). Furthermore, many existing tools only have a specialized set of features, struggle with very long or high-resolution ($\ge$ 4K) videos, or simply take too long to yield results. Existing fast algorithms are often severely limited with respect to the number of individuals that can be tracked simultaneously; for example xyTracker (\cite{rasch2016closing}) allows for real-time tracking at 40Hz while accurately maintaining identities, and thus is suitable for closed-loop experimentation (experiments where stimulus presentation can depend on the real-time behaviors of the individuals, e.g. \cite{bath2014flymad}, \cite{brembs2000operant}, \cite{bianco2015visuomotor}), but has a limit of being able to track only 5 individuals simultaneously. ToxTrac (\cite{rodriguez2018toxtrac}), a software comparable to xyTracker in it's set of features, is limited to 20 individuals and relatively low frame-rates ($\leq$25fps). Others, while implementing a wide range of features and offering high-performance tracking, are costly and thus limited in access (\cite{noldus2001ethovision}). Perhaps with the exception of proprietary software, one major problem at present is the severe fragmentation of features across the various software solutions. For example, experimentalists must typically construct work-flows from many individual tools: One tool might be responsible for estimating the animal's positions, another for estimating their posture, another one for reconstructing visual fields (which in turn probably also estimates animal posture, but does not export it in any way) and one for keeping identities -- correcting results of other tools post-hoc. It can take a very long time to make them all work effectively together, adding what is often considerable overhead to behavioral studies.
\TRex{}, the software released with this publication (available at \href{https://trex.run}{trex.run} under an Open-Source license), has been designed to address these problems, and thus to provide a powerful, fast and easy to use tool that will be of use in a wide range of behavioral studies. It allows users to track moving objects/animals, as long as there is a way to separate them from the background (e.g. static backgrounds, custom masks, as discussed below). In addition to the positions of individuals, our software provides other per-individual metrics such as body shape and, if applicable, head-/tail-position. This is achieved using a basic posture analysis, which works out of the box for most organisms, and, if required, can be easily adapted for others. Posture information, which includes the body center-line, can be useful for detecting e.g. courtship displays and other behaviors that might not otherwise be obvious from mere positional data. Additionally, with the visual sense often being one of the most important modalities to consider in behavioral research, we include the capability for users to obtain a computational reconstruction of the visual fields of all individuals (\citealt{strandburg2013visual}, \citealt{rosenthal2015revealing}). This not only reveals which individuals are visible from an individual's point-of-view, as well as the distance to them, but also which parts of others' bodies are visible.
Included in the software package is a task-specific tool, \TGrabs{}, that is employed to pre-process existing video files and which allows users to record directly from cameras capable of live-streaming to a computer (with extensible support from generic webcams to high-end machine vision cameras). It supports most of the above-mentioned tracking features (positions, posture, visual field) and provides access to results immediately while continuing to record/process. This not only saves time, since tracking results are available immediately after the trial, but makes closed-loop support possible for large groups of individuals ($\leq$ 128 individuals). \TRex{} and \TGrabs{} are written in \verb!C++! but, as part of our closed-loop support, we are providing a \verb!Python!-based general scripting interface which can be fully customized by the user without the need to recompile or relaunch. This interface allows for compatibility with external programs (e.g. for closed-loop stimulus-presentation) and other custom extensions.
The fast tracking described above employs information about the kinematics of each organism in order to try to maintain their identities. This is very fast and useful in many scenarios, e.g. where general assessments about group properties (group centroid, alignment of individuals, density, etc.) are to be made. However, when making conclusions about \textit{individuals} instead, maintaining identities perfectly throughout the video is a critical requirement. Every tracking method inevitably makes mistakes, which, for small groups of two or three individuals or short videos, can be corrected manually -- at the expense of spending much more time on analysis, which rapidly becomes prohibitive as the number of individuals to be tracked increases. To make matters worse, when multiple individuals stay out of view of the camera for too long (such as if individuals move out of frame, under a shelter, or occlude one another) there is no way to know who is whom once they re-emerge. With no baseline truth available (e.g. using physical tags as in \cite{alarcon2018automated}, \cite{nagy2013context}; or marker-less methods as in \cite{idtracker}, \cite{idtrackerai}, \cite{rasch2016closing}), these mistakes can not be corrected and accumulate over time, until eventually all identities are fully shuffled. To solve this problem (and without the need to mark, or add physical tags to individuals), \TRex{} can, at the cost of spending more time on analysis (and thus not during live-tracking), automatically learn the identity of up to approximately 100 unmarked individuals based on their visual appearance. This machine-learning based approach, herein termed \textit{visual identification}, provides an independent source of information on the identity of individuals, which is used to detect and correct potential tracking mistakes without the need for human supervision.
\changemade{In this paper, we evaluate the most important functions of our software} in terms of speed and reliability using a wide range of experimental systems, including termites, fruit flies, locusts and multiple species of schooling fish (although we stress that our software is not limited to such species).
Specifically regarding the visual identification of unmarked individuals in groups, \idtracker{} is currently state-of-the-art, yielding high-accuracy (>99\% in most cases) in maintaining consistent identity assignments across entire videos (\cite{idtrackerai}). Similarly to \TRex{}, this is achieved by training an artificial neural network to visually differentiate between individuals, and using identity predictions from this network to avoid/correct tracking mistakes. Both approaches work without human supervision, and are limited to approximately 100 individuals. Given that \idtracker{} is the only currently available tool with visual identification for such large groups of individuals, and also because of the quality of results, we will use it as a benchmark for our visual identification system. Results will be compared in terms of both accuracy and computation speed, showing \TRex{}' ability to achieve the same high level of accuracy but typically at far higher speeds, and with a much reduced memory requirement.
\TRex{} is platform-independent and runs on all major operating systems (Linux, Windows, macOS) and offers complete batch processing support, allowing users to efficiently process entire sets of videos without requiring human intervention. All parameters can be accessed either through settings files, from within the graphical user interface (or \textit{GUI}), or using the command-line. The user interface supports off-site access using a built-in web-server (although it is recommended to only use this from within a secure VPN environment). Available parameters are explained in the documentation directly as part of the GUI and on an external website (see below). Results can be exported to independent data-containers (\texttt{NPZ}, or \texttt{CSV} \changemade{for plain-text type data}) for further analyses in software of the user's choosing. We will not go into detail regarding the many GUI functions since albeit being of great utility to the researcher, they are only the means to easily apply the features presented herein. Some examples will be given in the main text and appendix, but a comprehensive collection of all of them, as well as detailed documentation, is available in the up-to-date online-documentation which can be found at \href{https://trex.run/docs}{trex.run/docs}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% FEATURES
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[!hb]
\begin{fullwidth}
\includegraphics[width=1.0\linewidth]{figures/software-overview.pdf}
\captionsetup{margin=0pt,calcmargin={0pt,-4.5cm}}
\caption{Videos are typically processed in four main stages, illustrated here each with a list of prominent features. Some of them are accessible from both \TRex{} and \TGrabs{}, while others are software specific (as shown at the very top). (a) The video is either recorded directly with our software (\TGrabs{}), or converted from a pre-recorded video file. Live-tracking enables users to perform closed-loop experiments, for which a virtual testing environment is provided. (b) Videos can be tracked and parameters adjusted with visual feedback. Various exploration and data presentation features are provided and customized data streams can be exported for use in external software. (c) After successful tracking, automatic visual identification can, optionally, be used to refine results. An artificial neural network is trained to recognize individuals, helping to automatically correct potential tracking mistakes. In the last stage, many graphical tools are available to users of \TRex{}, a selection of which is listed in (d).}
\label{fig:software_overview}
\videosupp{\changemade{This video shows an overview of the typical chronology of operations when using our software. Starting with the raw video, segmentation using \TGrabs{} (\figref{fig:software_overview}a) is the first and only step that is not optional. Tracking (\figref{fig:software_overview}b) and posture estimation (both also available for live-tracking in \TGrabs{}) are usually performed in that order, but can be partly parallelized (e.g. performing posture estimation in parallel for all individuals). Visual identification (\figref{fig:software_overview}c) is only available in \TRex{} due to relatively long processing times. All clips from this composite video have been recorded directly in \TRex{}. \url{https://youtu.be/g9EOi7FZHM0}}}
\end{fullwidth}
\end{figure}
\section{Results}\label{sec:methods_evaluation}
\changemade{Our software package consists of two task-specific tools, \TGrabs{} and \TRex{}, with different specializations. \TGrabs{} is primarily designed to connect to cameras and to be very fast. It employs the same program code as \TRex{} to achieve real-time online tracking, such as could be employed for closed-loop experiments (the user can launch \TGrabs{} from the opening dialog of \TRex{}). However, its focus on speed comes at the cost of not having access to the rich graphical user interface or more sophisticated (and thus slower) processing steps, such as deep-learning based identification, that \TRex{} provides. \TRex{} focusses on the more time-consuming tasks, as well as visual data exploration, re-tracking existing results -- but sometimes it simply functions as an easier-to-use graphical interface for tracking and adjusting parameters. Together they provide a wide range of capabilities to the user and are often used in sequence as part of the same work-flow. Typically, such a sequence can be summarized in four stages (see also \figref{fig:pipeline_overview} for a flow diagram):}
%The workflow for using our software is straightforward and can be summarized in four stages:
\begin{enumerate}
\item \textbf{Segmentation} in \TGrabs{}. When recording a video or converting a previously recorded file (e.g. MP4, .AVI, etc.), it is segmented into background and foreground-objects (\verb!blobs!), the latter typically being the entities to be tracked. Results are saved to a custom, non-proprietary video format (\verb!PV!) (\figref{fig:software_overview}a).
\item \textbf{Tracking} the video, either directly in \TGrabs{}, or \changemade{in \TRex{} after pre-processing,} with access to customizable visualizations and the ability to change tracking parameters on-the-fly. Here, we will describe two types of data available within \TRex{}, 2D posture- and visual-field estimation, as well as real-time applications of such data (\figref{fig:software_overview}b).
\item \changemade{\textbf{Automatic identity correction} (\figref{fig:software_overview}c), a way of utilizing the power of a trained neural network to perform visual identification of individuals, is available in \TRex{} only.} This step may not be necessary in many cases, but it is the only way to guarantee consistent identities throughout the video. It is also the \changemade{most processing-heavy (and thus usually the most time-consuming)} step, as well as the only one involving machine learning. All previously collected posture- and other tracking-related data are utilized in this step, placing it late in a typical workflow.
\item Data visualization is a critical component of any research project, especially for unfamiliar datasets, but manually crafting one for every new experiment can be very time-consuming. Thus, \TRex{} offers a universal, highly customizable, way to make all collected data available for interactive \textbf{exploration} (\figref{fig:software_overview}d) -- allowing users to change many display options and recording video clips for external playback. Tracking parameters can be adjusted on the fly (many with visual feedback) -- important e.g. when preparing a closed-loop feedback with a new species or setup.
\end{enumerate}
\begin{figure}[h]
\centering
\includegraphics[width=1\linewidth]{figures/pipeline_overview.pdf}
\caption{An overview of the interconnection between \TRex{}, \TGrabs{} and their data in- and output formats, with titles on the left corresponding to the stages in \figref{fig:software_overview}. Starting at the top of the figure, video is either streamed to \TGrabs{} from a file or directly from a compatible camera. At this stage, preprocessed data are saved to a \textit{.pv} file which can be read by \TRex{} later on. Thanks to its integration with parts of the \TRex{} code, \TGrabs{} can also perform online tracking for limited numbers of individuals, and save results to a \textit{.results} file (that can be opened by \TRex{}) along with individual tracking data saved to \protect\path{numpy} data-containers (\textit{.npz}) \changemade{or standard CSV files}, which can be used for analysis in third-party applications. If required, videos recorded directly using \TGrabs{} can also be streamed to a \textit{.mp4} video file which can be viewed in commonly available video players like \protect\path{VLC}.}
\label{fig:pipeline_overview}
\end{figure}
%\begin{featurebox}
%\caption{Compatibility with other segmentation software}
%\label{box:background-sub-compat}
%\medskip
%\end{featurebox}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% EVALUATION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section{Results}
Below we assess the performance \changemade{of our software} regarding three properties that are most important when using it (or in fact any tracking software) in practice: (i) The time it takes to perform tracking (ii) the time it takes to perform automatic identity correction and (iii) the peak memory consumption when correcting identities (since this is where memory consumption is maximal), as well as (iv) the accuracy of the produced trajectories after visual identification.
%(i) The accuracy of the produced trajectories in terms of keeping identities (ii) the time it took to produce results and (iii) the memory-consumption of the process.
While accuracy is an important metric and specific to identification tasks, time and memory are typically of considerable practical importance for all tasks. For example, tracking-speed may be the difference between only being able to run a few trials or producing more reliable results with a much larger number of trials. In addition, tracking speed can make a major difference as the number of individuals increases. Furthermore, memory constraints can be extremely prohibitive making tracking over long video sequences and/or for a large number of individuals extremely time-consuming, or impossible, for the user.
\begin{table}[t]
% Use "S" column identifier to align on decimal point
\begin{tabular}{l l l l l l l l l r}
\toprule
ID & species & common & {\# ind.} & fps (Hz) & duration & size ($\mathrm{px}^2$) \\
\midrule
\newtag{ 0 }{vid:reversals3m_1024_dotbot_20181025_105202.stitched} & \textit{Leucaspius delineatus} & sunbleak & 1024 & 40 & 8min20s & $3866\times 4048$\\
\newtag{ 1 }{vid:reversals3m_512_dotbot_20191111_165201.stitched} & \textit{Leucaspius delineatus} & sunbleak & 512 & 50 & 6min40s & $3866\times 4140$\\
\newtag{ 2 }{vid:reversals3m_512_dotbot_20190122_155201.stitched} & \textit{Leucaspius delineatus} & sunbleak & 512 & 60 & 5min59s & $3866\times 4048$\\
\newtag{ 3 }{vid:reversals3m_256_dotbot_20191122_154201.stitched} & \textit{Leucaspius delineatus} & sunbleak & 256 & 50 & 6min40s & $3866\times 4140$\\
\newtag{ 4 }{vid:reversals3m_256_dotbot_20181214_151202.stitched} & \textit{Leucaspius delineatus} & sunbleak & 256 & 60 & 5min59s & $3866\times 4048$\\
\newtag{ 5 }{vid:reversals3m_128_dotbot_20181211_153201.stitched} & \textit{Leucaspius delineatus} & sunbleak & 128 & 60 & 6min & $3866\times 4048$\\
\newtag{ 6 }{vid:reversals3m_128_dotbot_20190116_135201.stitched} & \textit{Leucaspius delineatus} & sunbleak & 128 & 60 & 5min59s & $3866\times 4048$\\
\newtag{ 7 }{vid:video_example_100fish_1min} & \textit{Danio rerio} & zebrafish & 100 & 32 & 1min & $3584\times 3500$\\
\newtag{ 8 }{vid:flies_N59} & \textit{Drosophila melanogaster} & fruit-fly & 59 & 51 & 10min & $2306\times 2306$\\
\newtag{ 9 }{vid:15locusts1h} & \textit{Schistocerca gregaria} & locust & 15 & 25 & 1h0min & $1880\times 1881$\\
\newtag{ 10 }{vid:N05HHS2019-10S-V1} & \textit{Constrictotermes cyphergaster} & termite & 10 & 100 & 10min5s & $1920\times 1080$\\
\newtag{ 11 }{vid:group_3} & \textit{Danio rerio} & zebrafish & 10 & 32 & 10min10s & $3712\times 3712$\\
\newtag{ 12 }{vid:group_2} & \textit{Danio rerio} & zebrafish & 10 & 32 & 10min3s & $3712\times 3712$\\
\newtag{ 13 }{vid:group_1} & \textit{Danio rerio} & zebrafish & 10 & 32 & 10min3s & $3712\times 3712$\\
\newtag{ 14 }{vid:guppy_8_t46_d1_20191207_102508} & \textit{Poecilia reticulata} & guppy & 8 & 30 & 3h15min22s & $3008\times 3008$\\
\newtag{ 15 }{vid:guppy_8_t36_d15_20191212_085800} & \textit{Poecilia reticulata} & guppy & 8 & 25 & 1h12min & $3008\times 3008$\\
\newtag{ 16 }{vid:guppy_8_t20_d1_20190512_115801} & \textit{Poecilia reticulata} & guppy & 8 & 35 & 3h18min13s & $3008\times 3008$\\
\newtag{ 17 }{vid:singleguppy_f2_d9} & \textit{Poecilia reticulata} & guppy & 1 & 140 & 1h9min32s & $1312\times 1312$\\
\bottomrule
\end{tabular}
\medskip
\caption{\label{tab:videos}A list of the videos used in this paper as part of the evaluation of \TRex{}, along with the species of animals in the videos and their common names, as well as other video-specific properties. Videos are given an incremental ID, to make references more efficient in the following text, which are sorted by the number of individuals in the video. Individual quantities are given accurately, except for the videos with more than 100 where the exact number may be slightly more or less. These videos have been analysed using \TRex{}' dynamic analysis mode that supports unknown quantities of animals.}
\tabledata{\changemade{Videos \vidref{vid:video_example_100fish_1min} and \vidref{vid:flies_N59}, as well as \vidref{vid:group_1}-\vidref{vid:group_3}, are available as part of the original \texttt{idtracker} paper (\cite{idtracker}). Many of the videos are part of yet unpublished data: Guppy videos have been recorded by A. Albi, videos with sunbleak (\textit{Leucaspius delineatus}) have been recorded by D. Bath. The termite video has been kindly provided by H. Hugo and the locust video by F. Oberhauser. Due to the size of some of these videos (>150GB per video), they have to be made available upon specific request. Raw versions of these videos (some trimmed), as well as full preprocessed versions, are available as part of the dataset published alongside this paper \cite{walter2020dataset}.}}
\end{table}
In all of our tests we used a relatively modest computer system, which could be described as a mid-range consumer or gaming PC:
\begin{enumerate} [label=\textnormal{$\bullet$}]
\item \label{ref:hardware_recommend} Intel Core i9-7900X CPU
\item NVIDIA Geforce 1080 Ti
\item 64GB RAM
\item NVMe PCIe x4 hard-drive
\item Debian bullseye (\href{https://www.debian.org/devel/debian-installer/}{debian.org})
\end{enumerate}
As can be seen in the following sections (memory consumption, processing speeds, etc.) using a high-end system is not necessary to run \TRex{} and, anecdotally, we did not observe noticeable improvements when using a solid state drive versus a normal hard drive. A video card (presently an NVIDIA card due to the requirements of TensorFlow) is recommended for tasks involving visual identification as such computations will take much longer without it -- however, it is not required. We decided to employ this system due to having a relatively cheap, compatible graphics card, as well as to ensure that we have an easy way to produce direct comparisons with \idtracker{} -- which according to their website requires large amounts of RAM (32-128GB, \href{https://idtrackerai.readthedocs.io/en/latest/how_to_install.html}{idtrackerai online documentation}) and a fast solid-state drive.
\tableref{videos} shows the entire set of videos used in this paper, which have been obtained from multiple sources (credited under the table) and span a wide range of different organisms, demonstrating \TRex{}' ability to track anything as long as it moves occasionally. Videos involving a large number (>100) of individuals are all the same species of fish since these were the only organisms we had available in such quantities. However, this is not to say that only fish could be tracked efficiently in these quantities. We used the full dataset with up to 1024 individuals in one video (\videoref{vid:reversals3m_1024_dotbot_20181025_105202.stitched}) to evaluate raw tracking speed without visual identification and identity corrections (next sub-section). However, since such numbers of individuals exceed the capacity of the neural network used for automatic identity corrections (compare also \cite{idtrackerai} who used a similar network), we only used a subset of these videos (videos \vidref{vid:video_example_100fish_1min} through \vidref{vid:guppy_8_t20_d1_20190512_115801}) to look specifically into the quality of our visual identification in terms of keeping identities and its memory consumption.
\subsection{Tracking: Speed and Accuracy}
In evaluating the \nameref{sec:tracking} portion of \TRex{}, the main focus lies with processing speed, while accuracy in terms of keeping identities is of secondary importance. Tracking is required in all other parts of the software, making it an attractive target for extensive optimization. Especially with regards to closed-loop, and live-tracking situations, there may be no room even to lose a millisecond between frames and thus risk dropping frames. We therefore designed \TRex{} to support the simultaneous tracking of many ($\geq$256) individuals \textit{quickly} and achieve reasonable \textit{accuracy} for up to 100 individuals -- which are the two suppositions we will investigate in the following.
Trials were run without posture/visual-field estimation enabled, where tracking generally, and consistently, reaches speeds faster than real-time (processing times of 1.5-40\% of the video duration, 25-100Hz) even for a relatively large number of individuals (77-94.77\% for up to 256 individuals, see \tableref{absolute_speeds_no_posture}). Videos with more individuals (>500) were still tracked within reasonable time of 235\% to 358\% of the video duration. As would be expected from these results, we found that combining tracking and recording in a single step generally leads to higher processing speeds. The only situation where this was not the case was a video with 1024 individuals, which suggests that live-tracking (in \TGrabs{}) handles cases with many individuals slightly worse than offline tracking (in \TRex{}). Otherwise, 5\% to 35\% shorter total processing times were measured (14.55\% on average, see \tableref{timings}), compared to running \TGrabs{} separately and then tracking in \TRex{}. These percentage differences, in most cases, reflect the ratio between the video duration and the time it takes to track it, suggesting that most time is spent -- by far -- on the conversion of videos. This additional cost can be avoided in practice when using \TGrabs{} to record videos, by directly writing to a custom format recognized by \TRex{}, and/or using its live-tracking ability to export tracking data immediately after the recording is stopped.
We also investigated trials that were run with posture estimation \textit{enabled} and we found that real-time speed could be achieved for videos with $\leq$ 128 individuals (see column "tracking" in \tableref{timings}). Tracking speed, when posture estimation is enabled, depends more strongly on the size of individuals in the image.
Generally, tracking software becomes slower as the number of individuals to be tracked increases, as a result of an exponentially growing number of combinations to consider during matching. \changemade{\TRex{} uses a novel tree-based algorithm by default (see \nameref{sec:tracking}), but circumvents problematic situations by falling back on using the \textit{Hungarian method} (also known as the \textit{Kuhn–Munkres algorithm}, \cite{kuhn1955hungarian})} when necessary. Comparing our mixed approach (see \nameref{sec:tracking}) to purely using the Hungarian method shows that, while both perform similarly for few individuals, the Hungarian method is easily outperformed by our algorithm for larger groups of individuals (as can be seen in \figref{fig:approx_accurate}). This might be due to custom optimizations regarding local cliques of individuals, whereby we ignore objects that are too far away, and also as a result of our optimized pre-sorting. The Hungarian method has the advantage of not leading to combinatorical explosions in some situations -- and thus has a lower \textit{maximum} complexity while proving to be less optimal in the \textit{average} case. For further details, see the appendix: \nameref{sec:matching_graph}.
\label{sec:evaluation_accuracy}
In addition to speed, we also tested the accuracy of our tracking method, with regards to the consistency of identity assignments, comparing its results to the manually reviewed data (the methodology of which is described in the next section). In order to avoid counting follow-up errors as "new" errors, we divided each trajectory in the uncorrected data into "uninterrupted" segments of frames, instead of simply comparing whole trajectories. A segment is interrupted when an individual is lost (for any of the reasons given in \nameref{sec:segments}) and starts again when it is reassigned to another object later on. We term these (re-)assignments \textit{decisions} here. Each segment of every individual can be uniquely assigned to a similar/identical segment in the baseline data and its identity. Following one trajectory in the uncorrected data, we can detect these wrong decisions by checking whether the baseline identity associated with one segment of that trajectory changes in the next. We found that roughly 80\% of such decisions made by the tree-based matching were correct, even with relatively high numbers of individuals (100). For trajectories where no manually reviewed data were available, we used automatically corrected trajectories as a base for our comparison -- we evaluate the accuracy of these automatically corrected trajectories in the following section. Even though we did not investigate accuracy in situations with more than 100 individuals, we suspect similar results since the property with the strongest influence on tracking accuracy -- individual density -- is limited physically and most of the investigated species school tightly in either case.
\begin{table}
% Use "S" column identifier to align on decimal point
\begin{tabular}{l l l l l l l r}
\toprule
video & {\# ind.} & N \TRex{} & \% similar individuals & $\diameter$ final uniqueness \\
\midrule
\vidref{vid:video_example_100fish_1min} & $100$ & $5$ & $99.8346\pm 0.5265$ & $0.9758\pm 0.0018$ \\
\vidref{vid:flies_N59} & $59$ & $5$ & $98.6885\pm 2.1145$ & $0.9356\pm 0.0358$ \\
\vidref{vid:group_1} & $10$ & $5$ & $99.9902\pm 0.3737$ & $0.9812\pm 0.0013$ \\
\vidref{vid:group_3} & $10$ & $5$ & $99.9212\pm 1.1208$ & $0.9461\pm 0.0039$ \\
\vidref{vid:group_2} & $10$ & $5$ & $99.9546\pm 0.8573$ & $0.9698\pm 0.0024$ \\
\vidref{vid:guppy_8_t46_d1_20191207_102508} & $8$ & $5$ & $98.8359\pm 5.8136$ & $0.9192\pm 0.0077$ \\
\vidref{vid:guppy_8_t36_d15_20191212_085800} & $8$ & $5$ & $99.2246\pm 4.4486$ & $0.9576\pm 0.0023$ \\
\vidref{vid:guppy_8_t20_d1_20190512_115801} & $8$ & $5$ & $99.7704\pm 2.1994$ & $0.9481\pm 0.0025$ \\
\bottomrule
\end{tabular}
\medskip
\caption{\label{tab:recognition_acc}Evaluating comparability of the automatic visual identification between \idtracker{} and \TRex{}. Columns show various video properties, as well as the associated uniqueness score (see \autoref{box:uniqueness_score}) and a similarity metric. Similarity (\textit{\% similar individuals}) is calculated based on comparing the positions for each identity exported by both tools, choosing the closest matches overall and counting the ones that are differently assigned per frame. An individual is classified as "wrong" in that frame, if the euclidean distance between the matched solutions from \idtracker{} and \TRex{} exceeds 1\% of the video width. The column "\% similar individuals" shows percentage values, where a value of $99\%$ would indicate that, on average, 1\% of the individuals are assigned differently. To demonstrate how uniqueness corresponds to the quality of results, the last column shows the average uniqueness achieved across trials.}
\tabledata{\changemade{This file contains all X and Y positions for each trial and each software combined into one very large table. This data is also available in different formats in \cite{walter2020dataset}.}}
\tabledata{\changemade{Assignments between identities from multiple solutions, as calculated by a bipartite-graph matching algorithm. For each permutation of trials from \TRex{} and \idtracker{} for the same video, the algorithm sought to match the trajectories of the same physical individuals in both trials with each other by finding the ones with the smallest mean euclidean distance per frame between them. Available at \url{http://dx.doi.org/10.17617/3.4y}, as \protect\path{T2_source_data.zip}.}}
\end{table}
\subsection{Visual Identification: Accuracy} \label{sec:maintaining_identities}
Since the goal of using visual identification is to generate consistent identity assignments, we evaluated the accuracy of our method in this regard. As a benchmark, we compare it to manually reviewed datasets as well as results from \idtracker{} for the same set of videos (where possible). In order to validate trajectories exported by either software, we manually reviewed multiple videos with the help from a tool within \TRex{} that allows to view each crossing and correct possible mistakes in-place. Assignments were deemed incorrect, and subsequently corrected by the reviewer, if the centroid of a given individual was not contained within the object it was assigned to (e.g. the individual was not part of the correct object). Double assignments per object are impossible due to the nature of the tracking method. Individuals were also forcibly assigned to the correct objects in case they were visible but not detected by the tracking algorithm. After manual corrections had been applied, "clean" trajectories were exported -- providing a per-frame baseline truth for the respective videos. A complete table of reviewed videos, and the percentage of reviewed frames per video, can be found in \tableref{reviewed_crossings}. For longer videos (>1h) we relied entirely on a comparison between results from \idtracker{} and \TRex{}. Their paper (\cite{idtrackerai}) suggests a very high accuracy of over 99.9\% correctly identified individual images for most videos, which should suffice for most relevant applications and provide a good baseline truth. As long as both tools produce sufficiently similar trajectories, we therefore know they have found the correct solution.
A direct comparison between \TRex{} and \idtracker{} was not possible for videos \vidref{vid:15locusts1h} and \vidref{vid:N05HHS2019-10S-V1}, where \idtracker{} frequently exceeded hardware memory-limits and caused the application to be terminated, or did not produce usable results within multiple days of run-time. However, we were able to successfully analyse these videos with \TRex{} and evaluate its performance by comparing to manually reviewed trajectories (see below in \nameref{sec:maintaining_identities}). Due to the stochastic nature of machine learning, and thus the inherent possibility of obtaining different results in each run, as well as other potential factors influencing processing time and memory consumption, both \TRex{} and \idtracker{} have been executed repeatedly (5x \TRex{}, 3x \idtracker{}).
\begin{table}
\begin{tabular}{l l | l l | l l}
\toprule
\multicolumn{2}{c|}{video metrics} & \multicolumn{2}{c|}{review stats} & \multicolumn{2}{c}{\% correct} \\
\midrule
video & \textbf{{\# ind.}} & reviewed (\%) & of that interpolated (\%) & \TRex{} & \idtracker{} \\
\midrule
\vidref{vid:video_example_100fish_1min} & 100 & 100.0 & $ 0.23 $ & $ 99.97 \pm 0.013 $ &
$ 98.95 \pm 0.146 $ \\
\vidref{vid:flies_N59} & 59 & 100.0 & $ 0.15 $ & $ 99.68 \pm 0.533 $ &
$ 99.94 \pm 0.0 $ \\
\vidref{vid:15locusts1h} & 15 & 22.2 & $ 8.44 $ & $ 95.12 \pm 6.077 $ &
N/A \\
\vidref{vid:N05HHS2019-10S-V1} & 10 & 100.0 & $ 1.21 $ & $ 99.7 \pm 0.088 $ &
N/A \\
\vidref{vid:group_1} & 10 & 100.0 & $ 0.27 $ & $ 99.98 \pm 0.0 $ &
$ 99.96 \pm 0.0 $ \\
\vidref{vid:group_2} & 10 & 100.0 & $ 0.59 $ & $ 99.94 \pm 0.006 $ &
$ 99.63 \pm 0.0 $ \\
\vidref{vid:group_3} & 10 & 100.0 & $ 0.5 $ & $ 99.89 \pm 0.009 $ &
$ 99.34 \pm 0.002 $ \\
\bottomrule
\end{tabular}
\medskip
\caption{\label{tab:reviewed_crossings} Results of the human validation for a subset of videos. Validation was performed by going through all problematic situations (e.g. individuals lost) and correcting mistakes manually, creating a fully corrected dataset for the given videos. This dataset may still have missing frames for some individuals, if they could not be detected in certain frames (as indicated by "of that interpolated"). This was usually a very low percentage of all frames, except for \videoref{vid:15locusts1h}, where individuals tended to rest on top of each other -- and were thus not tracked -- for extended periods of time. This baseline dataset was compared to all other results obtained using the automatic visual identification by \TRex{} ($N=5$) and \idtracker{} ($N=3$) to estimate correctness. We were not able to track videos \vidref{vid:15locusts1h} and \vidref{vid:N05HHS2019-10S-V1} with \idtracker{}, which is why correctness values are not available.}
\tabledata{\changemade{A table of positions for each individual of each manually approved and corrected trial.}}
\end{table}
The trajectories exported by both \idtracker{} and \TRex{} were very similar throughout (see \tableref{recognition_acc}). While occasional disagreements happened, similarity scores were higher than \changemade{98\% in all and higher than 99\% in most cases} (i.e. less than 1\% of individuals have been differently assigned in each frame on average). Most difficulties that \textit{did} occur were, after manual review, attributable to situations where multiple individuals cross over excessively within a short time-span. In each case that has been manually reviewed, identities switched back to the correct individuals -- even after temporary disagreement. We found that both solutions occasionally experienced these same problems, which often occur when individuals repeatedly come in and out of view in quick succession (e.g. overlapping with other individuals). Disagreements were expected for videos with many such situations due to the way both algorithms deal differently with them: \idtracker{} assigns identities only based on the network output. In many cases, individuals continue to partly overlap even while already being tracked, which results in visual artifacts and can lead to unstable predictions by the network and causing \idtracker{'s} approach to fail. Comparing results from both \idtracker{} and \TRex{} to manually reviewed data (see \tableref{reviewed_crossings}) shows that both solutions consistently provide high accuracy results of above 99.5\% for most videos, but that \TRex{} is slightly improved in all cases while also having a better overall frame coverage per individual (99.65\% versus \idtracker{'s} 97.93\%, where 100\% would mean that all individuals are tracked in every frame; not shown). This suggests that the splitting algorithm (see appendix, \nameref{box:splitting-algorithm}) is working to \TRex{}' advantage here.
\begin{figure}[h]
\centering
%\captionsetup[subfigure]{justification=centering}
%\begin{subfigure}[b]{0.9\textwidth}
% \centering
% \includegraphics[width=\textwidth]{activations15locusts1h.pdf}
% \caption{Locusts from \videoref{vid:15locusts1h} with 15 tagged individuals (N: 5101, 7942, 9974) -- the only video with physical tags. The network activates more strongly in regions close to the tag, as well as the bottom right corner.
% }
% \label{fig:activate_locusts}
% \end{subfigure}
% \begin{subfigure}[b]{0.9\textwidth}
% \centering
% \includegraphics[width=\textwidth]{activationsguppy_8_t36_d15_20191212_085800.pdf}
% \caption{Guppies from \videoref{vid:guppy_8_t36_d15_20191212_085800} (N: 46378, 34733, 34745). Activations are less focussed and less consistent across individuals.}
% \label{fig:activate_guppies}
% \end{subfigure}
% \begin{subfigure}[b]{0.9\textwidth}
% \centering\includegraphics[width=\textwidth]{activationsflies_N59.pdf}
% \caption{Flies from \videoref{vid:flies_N59} (N: 993, 1986, 993). Activations are not similar between individuals and show various "hotspots" across the entire body.}
% \label{fig:activate_flies}
% \end{subfigure}
% \begin{subfigure}[b]{0.9\textwidth}
% \centering\includegraphics[width=\textwidth]{activationsN05HHS2019-10S-V1.pdf}
% \caption{Termites from \videoref{vid:N05HHS2019-10S-V1} (N: 27097, 31135, 22746). Here, the connections between body-segments show strong activations -- in contrast to very weak ones in other parts of the body. %Anecdotally, but as can be seen here for the first two individuals, activations seem to be strong especially close to connections of body-segments.
% }
% \label{fig:activate_termites}
% \end{subfigure}
\includegraphics[width=\textwidth]{figures/fig_activations.pdf}
\caption{\label{fig:network_activations}Activation differences for images of randomly selected individuals from four videos, next to a median image of the respective individual -- which hides thin extremities, such as legs in (a) and (c). The captions in (a-d) detail the species per group and number of samples per individual. Colors represent the relative activation differences, with hotter colors suggesting bigger magnitudes, which are computed by performing a forward-pass through the network up to the last convolutional layer (using \href{https://github.com/philipperemy/keract}{keract}). The outputs for each identity are averaged and stretched back to the original image size by cropping and scaling according to the network architecture. Differences shown here are calculated per cluster of pixels corresponding to each filter, comparing average activations for images from the individual's class to activations for images from other classes.}
\figdata{\changemade{Code, as well as images/weights needed to produce this figure (see README).}}
\end{figure}
Additionally, while \TRex{} could successfully track individuals in all videos without tags, we were interested to see the effect of tags (in this case QR tags attached to locusts, see \figref{fig:network_activations}a) %\figref{fig:activate_locusts})
on network training. In \figref{fig:network_activations} we visualise differences in network activation, depending on the visual features available for the network to learn from, which are different between species (or due to physically added tags, as mentioned above). The "hot" regions indicate larger between-class differences for that specific pixel (values are the result of activation in the last convolutional layer of the trained network, see figure legend). Differences are computed separately within each group and are not directly comparable between trials/species in value. However, the distribution of values -- reflecting the network's reactivity to specific parts of the image -- is. Results show that the most apparent differences are found for the stationary parts of the body (not in absolute terms, but following normalization, as shown in \figref{fig:datasets_comparison}c), which makes sense seeing as this part (i) is the easiest to learn due to it being in exactly the same position every time, (ii) larger individuals stretch further into the corners of a cropped image, making the bottom right of each image a source of valuable information (especially in \figref{fig:network_activations}a%\figref{fig:activate_locusts}
/\figref{fig:network_activations}b) %\figref{fig:activate_guppies})
and (iii) details that often occur in the head-region (like distance between the eyes) which can also play a role here. "Hot" regions in the bottom right corner of the activation images (e.g. in \figref{fig:network_activations}d) %\figref{fig:activate_termites}%
suggest that also pixels are reacted to which are explicitly \textit{not} part of the individual itself but of other individuals -- likely this corresponds to the network making use of size/shape differences between them.
\begin{table}
% Use "S" column identifier to align on decimal point
\begin{tabular}{l l l l l l l l | r}
\toprule
video & \#ind. & length & max.consec.& TRex memory (GB) & idtracker.ai memory (GB) \\
\midrule
\vidref{vid:group_2} & $10$ & $10\mathrm{min}$ & $26.03\mathrm{s}$ & $ \diameter\ 4.88\pm 0.23, \max 6.31$ & $ \diameter\ 8.23\pm 0.99, \max 28.85$ \\
\vidref{vid:group_1} & $10$ & $10\mathrm{min}$ & $36.94\mathrm{s}$ & $ \diameter\ 4.27\pm 0.12, \max 4.79$ & $ \diameter\ 7.83\pm 1.05, \max 29.43$ \\
\vidref{vid:group_3} & $10$ & $10\mathrm{min}$ & $28.75\mathrm{s}$ & $ \diameter\ 4.37\pm 0.32, \max 5.49$ & $ \diameter\ 6.53\pm 4.29, \max 29.32$ \\
\vidref{vid:video_example_100fish_1min} & $100$ & $1\mathrm{min}$ & $5.97\mathrm{s}$ & $ \diameter\ 9.4\pm 0.47, \max 13.45$ & $ \diameter\ 15.27\pm 1.05, \max 24.39$ \\
\vidref{vid:guppy_8_t36_d15_20191212_085800} & $8$ & $72\mathrm{min}$ & $79.4\mathrm{s}$ & $ \diameter\ 5.6\pm 0.22, \max 8.41$ & $ \diameter\ 35.2\pm 4.51, \max 91.26$ \\
\vidref{vid:N05HHS2019-10S-V1} & $10$ & $10\mathrm{min}$ & $1.91\mathrm{s}$ & $ \diameter\ 6.94\pm 0.27, \max 10.71$& N/A \\
\vidref{vid:15locusts1h} & $15$ & $60\mathrm{min}$ & $7.64\mathrm{s}$ & $ \diameter\ 13.81\pm 0.53, \max 16.99$& N/A \\
\vidref{vid:flies_N59} & $59$ & $10\mathrm{min}$ & $102.35\mathrm{s}$ & $ \diameter\ 12.4\pm 0.56, \max 17.41$ & $ \diameter\ 35.3\pm 0.92, \max 50.26$ \\
\vidref{vid:guppy_8_t46_d1_20191207_102508} & $8$ & $195\mathrm{min}$ & $145.77\mathrm{s}$ & $ \diameter\ 12.44\pm 0.8, \max 21.99$ & $ \diameter\ 35.08\pm 4.08, \max 98.04$ \\
\vidref{vid:guppy_8_t20_d1_20190512_115801} & $8$ & $198\mathrm{min}$ & $322.57\mathrm{s}$ & $ \diameter\ 16.15\pm 1.6, \max 28.62$ & $ \diameter\ 49.24\pm 8.21, \max 115.37$ \\
\bottomrule
\end{tabular}
\medskip
\caption{\label{tab:memory_table}Both \TRex{} and \idtracker{} analysed the same set of videos, while continuously logging their memory consumption using an external tool. Rows have been sorted by $\mathrm{video\_length} * \mathrm{\#individuals}$, which seems to be a good predictor for the memory consumption of both solutions. \idtracker{} has mixed mean values, which, at low individual densities are similar to \TRex{}' results. Mean values can be misleading here, since more time spent in low-memory states skews results. The maximum, however, is more reliable since it marks the memory that is necessary to run the system. Here, \idtracker{} clocks in at significantly higher values (almost always more than double) than \TRex{}.}
\tabledata{\changemade{Data from log files for all trials as a single table, where each row is one sample. The total memory of each sample is calculated as $\mathrm{SWAP} + \mathrm{PRIVATE} + \mathrm{SHARED}$. Each row indicates at which exact time, by which software, and as part of which trial it was taken.}}
\end{table}
\begin{figure}
\centering
\includegraphics[width=0.9\linewidth]{figures/memory_consumption.pdf}
\caption{The maximum memory by \TRex{} and \idtracker{} when tracking videos from a subset of all videos (the same videos as in \tableref{recognition_acc}). Results are plotted as a function of video length (min) multiplied by the number of individuals. We have to emphasize here that, for the videos in the upper length regions of multiple hours (\vidref{vid:guppy_8_t20_d1_20190512_115801}, \vidref{vid:guppy_8_t46_d1_20191207_102508}), we had to set \idtracker{} to store segmentation information on disk -- as opposed to in RAM. This uses less memory, but is also slower. For the video with flies we tried out both and also settled for on-disk, since otherwise the system ran out of memory. Even then, the curve still accelerates much faster for \idtracker{}, ultimately leading to problems with most computer systems. To minimize the impact that hardware compatibility has on research, we implemented switches limiting memory usage while always trying to maximize performance given the available data. \TRex{} can be used on modern laptops and normal consumer hardware at slightly lower speeds, but without any \textit{fatal} issues.}
\label{fig:memory_per_video_length}
\figdata{\changemade{Each data-point from \figref{fig:memory_per_video_length} as plotted, indexed by video and software used.}}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[width=0.9\linewidth]{figures/raw_posture_moments.pdf}
\caption{Convergence behavior of the network training for three different normalization methods. This shows the maximum achievable validation accuracy after 100 epochs for 100 individuals (\videoref{vid:video_example_100fish_1min}), when sub-sampling the number of examples per individual. Tests were performed using a manually corrected training dataset to generate the images in three different ways, using the same, independent script (see \figref{fig:datasets_comparison}): Using no normalization (blue), using normalization based on image moments (green, similar to \idtracker{}), and using posture information (red, as in \TRex{}). Higher numbers of samples per individual result in higher maximum accuracy overall, but -- unlike the other methods -- posture-normalized runs already reach an accuracy above the 90\% mark for $\geq$75 samples. This property can help significantly in situations with more crossings, when longer global segments are harder to find.}
\label{fig:maximum_val_acc_per_samples}
\figdata{\changemade{Raw data-points as plotted in \figref{fig:maximum_val_acc_per_samples}.}}
\end{figure}
As would be expected, distinct patterns can be recognized in the resulting activations after training as soon as physical tags are attached to individuals (as in \figref{fig:network_activations}a). %\figref{fig:activate_locusts}).
While other parts of the image are still heavily activated (probably to benefit from size/shape differences between individuals), tags are always at least a large part of where activations concentrate. The network seemingly makes use of the additional information provided by the experimenter, where that has occurred. This suggests that, while definitely not being necessary, adding tags probably does not worsen, and likely may even improve, training accuracy, for difficult cases allowing networks to exploit any source of inter-individual variation.
\subsection{Visual Identification: Memory Consumption}
In order to generate comparable results between both tested software solutions, the same external script has been used to measure shared, private and swap memory of \idtracker{} and \TRex{}, respectively. There are a number of ways with which to determine the memory usage of a process. For automation purposes we decided to use a tool called \href{https://github.com/jeetsukumaran/Syrupy}{syrupy}, which can start and save information about a specified command automatically. We modified it slightly, so we could obtain more accurate measurements for Swap, Shared and Private separately, using \href{http://www.pixelbeat.org/scripts/ps_mem.py}{ps\_mem}.
As expected, differences in memory consumption are especially prominent for long videos (4-7x lower maximum memory), and for videos with many individuals (2-3x lower). Since we already experienced significant problems tracking a long video (>3h) of only 8 individuals with \idtracker{}, we did not attempt to further study its behavior in long videos with many individuals. However, we would expect \idtracker{'s} memory usage to increase even more rapidly than is visible in \figref{fig:memory_per_video_length} since it retains a lot of image data (segmentation/pixels) in memory and we already had to "allow" it to relay to hard-disk in our efforts to make it work for Videos \vidref{vid:flies_N59}, \vidref{vid:guppy_8_t46_d1_20191207_102508} and \vidref{vid:guppy_8_t20_d1_20190512_115801} (which slows down analysis). The maximum memory consumption across all trials was on average 5.01$\pm$2.54 times higher in \idtracker{}, ranging from 1.81 to 10.85 times the maximum memory consumption of \TRex{} for the same video.
Overall memory consumption for \TRex{} also contains posture data, which contributes a lot to RAM usage. Especially with longer videos, disabling posture can lower the hardware needs for running our software. If posture is to be retained, the user can still (more slightly) reduce memory requirements by changing the outline re-sampling scale (1 by default), which adjusts the outline resolution between sub- and super-pixel accuracy. While analysis will be faster -- and memory consumption lower -- when posture is disabled (only limited by the matching algorithm, see \figref{fig:approx_accurate}), users of the visual identification might experience a decrease in training accuracy or speed (see \figref{fig:maximum_val_acc_per_samples}).
\begin{table}[!h]
% Use "S" column identifier to align on decimal point
\begin{tabular}{l l l l l l || l | r}
\toprule
video & {\# ind.} & length & sample & \TGrabs{} (min) & \TRex{} (min) & ours (min) & \idtracker{} (min) \\
\midrule
\vidref{vid:video_example_100fish_1min} & 100 & $ 1 \mathrm{min} $ & $ 1.61 \mathrm{s} $ & $ 2.03 \pm 0.02 $ & $ 74.62 \pm 6.75 $ & $ 76.65 $ &
$ 392.22 \pm 119.43 $ \\
\vidref{vid:flies_N59} & 59 & $ 10 \mathrm{min} $ & $ 19.46 \mathrm{s} $ & $ 9.28 \pm 0.08 $ & $ 96.7 \pm 4.45 $ & $ 105.98 $ &
$ 4953.82 \pm 115.92 $ \\
\vidref{vid:15locusts1h} & 15 & $ 60 \mathrm{min} $ & $ 33.81 \mathrm{s} $ & $ 13.17 \pm 0.12 $ & $ 101.5 \pm 1.85 $ & $ 114.67 $ &
N/A \\
\vidref{vid:group_3} & 10 & $ 10 \mathrm{min} $ & $ 12.31 \mathrm{s} $ & $ 8.8 \pm 0.12 $ & $ 21.42 \pm 2.45 $ & $ 30.22 $ &
$ 127.43 \pm 57.02 $ \\
\vidref{vid:group_2} & 10 & $ 10 \mathrm{min} $ & $ 10.0 \mathrm{s} $ & $ 8.65 \pm 0.07 $ & $ 23.37 \pm 3.83 $ & $ 32.02 $ &
$ 82.28 \pm 3.83 $ \\
\vidref{vid:group_1} & 10 & $ 10 \mathrm{min} $ & $ 36.91 \mathrm{s} $ & $ 8.65 \pm 0.07 $ & $ 12.47 \pm 1.27 $ & $ 21.12 $ &
$ 79.42 \pm 4.52 $ \\
\vidref{vid:N05HHS2019-10S-V1} & 10 & $ 10 \mathrm{min} $ & $ 16.22 \mathrm{s} $ & $ 4.43 \pm 0.05 $ & $ 35.05 \pm 1.45 $ & $ 39.48 $ &
N/A \\
\vidref{vid:guppy_8_t46_d1_20191207_102508} & 8 & $ 195 \mathrm{min} $ & $ 67.97 \mathrm{s} $ & $ 109.97 \pm 2.05 $ & $ 70.48 \pm 3.67 $ & $ 180.45 $ &
$ 707.0 \pm 27.55 $ \\
\vidref{vid:guppy_8_t36_d15_20191212_085800} & 8 & $ 72 \mathrm{min} $ & $ 79.36 \mathrm{s} $ & $ 32.1 \pm 0.42 $ & $ 30.77 \pm 6.28 $ & $ 62.87 $ &
$ 291.42 \pm 16.83 $ \\
\vidref{vid:guppy_8_t20_d1_20190512_115801} & 8 & $ 198 \mathrm{min} $ & $ 134.07 \mathrm{s} $ & $ 133.1 \pm 2.28 $ & $ 68.85 \pm 13.12 $ & $ 201.95 $ &
$ 1493.83 \pm 27.75 $ \\
\bottomrule
\end{tabular}
\medskip
%\tabledata{\changemade{}
\caption{\label{tab:recognition_timings}Evaluating time-cost for automatic identity correction -- comparing to results from \idtracker{}. Timings consist of preprocessing time in \TGrabs{} plus network training in \TRex{}, which are shown separately as well as combined (\textit{ours (min)}, $N=5$). The time it takes to analyse videos strongly depends on the number of individuals and how many usable samples per individual the initial segment provides. The length of the video factors in as well, as does the stochasticity of the gradient descent (training). \idtracker{} timings ($N=3$) contain the whole tracking and training process from start to finish, using its \texttt{terminal\_mode} (v3). Parameters have been manually adjusted per video and setting, to the best of our abilities, spending at most one hour per configuration. For videos \vidref{vid:guppy_8_t20_d1_20190512_115801} and \vidref{vid:guppy_8_t46_d1_20191207_102508} we had to set \idtracker{} to storing segmentation information on disk (as compared to in RAM) to prevent the program from being terminated for running out of memory.}
\tabledata{\changemade{Preprocessed log files (see also \protect\path{notebooks.zip} in \cite{walter2020dataset}) in a table format. The total processing time (s) of each trial is indexed by video and software used -- \TGrabs{} for conversion and \TRex{} and \idtracker{} for visual identification. This data is also used in \tableref{timings}.}}
\end{table}
\subsection{Visual Identification: Processing Time}
Automatically correcting the trajectories (to produce consistent identity assignments) means that additional time is spent on the training and application of a network, specifically for the video in question. Visual identification builds on some of the other methods described in this paper (tracking and posture estimation), naturally making it by far the most complex and time-consuming process in \TRex{} -- we thus evaluated how much time is spent on the entire sequence of all required processes. For each run of \TRex{} and \idtracker{}, we saved precise timing information from start to finish. Since \idtracker{} reads videos \textit{directly} and preprocesses them again each run, we used the same starting conditions with our software for a direct comparison:
A trial starts by converting/preprocessing a video in \TGrabs{} and then immediately opening it in \TRex{}, where automatic identity corrections were applied. \TRex{} terminated automatically after satisfying a correctness criterion (high uniqueness value) according to equation \eqref{eq:gooduniqueness}. It then exported trajectories, as well as validation data (similar to \idtracker{}), concluding the trial. The sum of time spent within \TGrabs{} and \TRex{} gives the total amount of time for that trial. For the purpose of this test it would not have been fair to compare only \TRex{} processing times to \idtracker{}, but it is important to emphasize that conversion could be skipped entirely by using \TGrabs{} to record videos directly from a camera instead of opening an existing video file.
%Conversion times correlated strongly with the total video-length (in frames) and not the number of individuals, suggesting conversion was only constrained by video-decoding/reading speeds and not by (pre-)processing.
\changemade{In \tableref{recognition_timings} we can see that video length and processing times (in \TRex{}) did not correlate directly. Indeed, a 1 minute video (\videoref{vid:flies_N59} took significantly longer than one that was 60 minutes long (\videoref{vid:guppy_8_t36_d15_20191212_085800}). The reason for this, initially counterintuitive, result is that the process of learning identities requires sufficiently long video sequences: longer samples have a higher likelihood of capturing more of the total possible intra-individual variance which helps the algorithm to more comprehensively represent each individual's appearance. Longer videos naturally provide more material for the algorithm to choose from and, simply due to their length, have a higher probability of containing at least one higher-quality segment that allows higher uniqueness-regimes to be reached more quickly (see \nameref{sec:training_quality} and \nameref{sec:recognition_stopping}). Thus, it is important to use sufficiently long video sequences for visual identification, and longer sequences can lead to better results -- both in terms of quality and processing time.}
%a sample that, due to its length, likely contains more of the total variance
%conversion times in \TGrabs{} often overtook processing times in \TRex{} with increasing video-length if the number of individuals remained the same -- suggesting that
%In \tableref{recognition_timings} we can see that video length and processing times did not correlate directly. \changemade{Conversion times correlated with the total video-length (in frames) and not the number of individuals, suggesting conversion was only constrained by video-decoding/reading speeds and not by (pre-)processing. Indeed, conversion times in \TGrabs{} often overtook processing times in \TRex{} with increasing video-length if the number of individuals remained the same. Furthermore, the time it took to track and correct a video was shorter when the initial segment (column "sample" in the table) was longer (and as such likely capturing more visual intra-individual variation). Longer videos often provide more material for the algorithm to choose from and (simply due to their length) have a higher probability of producing at least one such higher-quality segment. Starting with a sample containing most of the variance, likely produces higher uniqueness-scores (\nameref{sec:training_quality}) early on and prompt the algorithm to terminate early. Perhaps counter-intuitively, using longer video-sequences instead of shorter ones should, in terms of time and accuracy, can thus be expected to produce better results and more quickly.}
Compared to \idtracker{}, \TRex{} (conversion + visual identification) shows both considerably lower computation times ($2.57\times$ to $46.74\times$ faster for the same video), as well as lower variance in the timings ($79\%$ lower for the same video on average).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% LIMITATIONS, DISCUSSION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion}
\changemade{We have designed \TRex{} to be a versatile and fast program that can enable researches to track} animals (and other mobile objects) in a wide range of situations. It maintains identities of up to 100 un-tagged individuals and produces corrected tracks, along with posture estimation\changemade{, visual-field reconstruction, and other features that enable the quantitative study of animal behavior.} Even videos that can not be tracked by other solutions, such as videos with over 500 animals, can now be tracked within the same day of recording.
While all options are available from the command-line and a screen is not required, \TRex{} offers a rich, yet straight-foward to use, interface to local as well as remote users. Accompanied by the integrated documentation for all parameters, each stating purpose, type and value ranges, as well as a comprehensive online documentation, \changemade{new users are provided with all the information required for a quick adoption of our software.} Especially to the benefit of new users, we evaluated the parameter space \changemade{using videos of diverse species} (fish, termites, locusts) and determined which parameters work best in most use-cases to set their default values.
\begin{figure}[h]
%\begin{fullwidth}
\includegraphics[width=1.0\linewidth]{figures/trex_screenshot.pdf}
%\captionsetup{margin=0pt,calcmargin={0pt,-4.5cm}}
\caption{An overview of \TRex{'} the main interface, which is part of the documentation at \href{https://trex.run/docs}{trex.run/docs}. Interface elements are sorted into categories in the four corners of the screen (labelled here in black). The omni-box on the bottom left corner allows users to change parameters on-the-fly, helped by a live auto-completion and documentation for all settings. Only some of the many available features are displayed here. Generally, interface elements can be toggled on or off using the bottom-left display options or moved out of the way with the cursor. Users can customize the tinting of objects (e.g. sourcing it from their speed) to generate interesting effect and can be recorded for use in presentations. Additionally, all exportable metrics (such as border-distance, size, x/y, etc.) can also be shown as an animated graph for a number of selected objects. Keyboard shortcuts are available for select features such as loading, saving, and terminating the program. Remote access is supported and offers the same graphical user interface, e.g. in case the software is executed without an application window (for batch processing purposes).}
\label{fig:trex_screenshot}
%\end{fullwidth}
\end{figure}
The interface is structured into groups (see \figref{fig:trex_screenshot}), categorized by the typical use-case:
\begin{enumerate}
\item The main menu, containing options for loading/saving, options for the timeline and reanalysis of parts of the video
\item Timeline and current video playback information
\item Information about the selected individual
\item Display options and an interactive "omni-box" for viewing and changing parameters
\item General status information about \TRex{} and the \texttt{Python} integration
\end{enumerate}
The tracking accuracy of \TRex{} is at the state-of-the-art while typically being $2.57\times$ to $46.74\times$ faster than comparable software and having lower hardware requirements -- \changemade{especially} RAM. In addition to visual identification and tracking, it provides a rich assortment of additional data, including body posture, visual fields, and other kinematic as well as group-related information (such as derivatives of position, border and mean neighbor distance, group compactness, etc.); even in live-tracking and closed-loop situations.
Raw tracking speeds (without visual identification) still achieved roughly 80\% accuracy per decision (as compared to >99\% with visual identification). We have found that real-time performance can be achieved, even on relatively modest hardware, for all numbers of individuals $\leq$256 without posture estimation ($\leq$ 128 with posture estimation). More than 256 individuals can be tracked as well, remarkably still delivering frame-rates at about 10-25 frames per second using the same settings.
Not only does the increased processing-speeds benefit researchers, but the contributions we provide to data exploration should not be underestimated as well -- merely making data more easily accessible right out-of-the-box, such as visual fields and live-heatmaps, has the potential to reveal features of group- and individual behaviour which have not been visible before. \TRex{} makes information on multiple timescales of events available simultaneously, and sometimes this is the only way to detect interesting properties (e.g. trail formation in termites).
%\subsection{Future extensions}
Since the software is already actively used within the Max Planck Institute of Animal Behavior, reported issues have been taken into consideration during development. However, certain theoretical, as well as practically observed, limitations remain:
\begin{itemize}
\item Posture: While almost all shapes can be detected correctly (by adjusting parameters), some shapes -- especially round shapes -- are hard to interpret in terms of "tail" or "head". This means that only the other image alignment method (moments) can be used. However, it does introduce some limitations e.g. calculating visual fields is impossible.
\item Tracking: Predictions, if the wrong direction is assumed, might go really far away from where the object is. Objects are then "lost" for a fixed amount of time (parameter). This can be "fixed" by shortening this time-period, though this leads to different problems when the software does not wait long enough for individuals to reappear.
\item General: Barely visible individuals have to be tracked with the help of deep learning (e.g. using \cite{Cae+17}) and a custom-made mask per video frame, prepared in an external program of the users choosing
\item Visual identification: All individuals have to be \textit{visible} and \textit{separate} at the same time, at least once, for identification to work at all. Visual identification, e.g. with very high densities of individuals, can thus be very difficult. This is a hard restriction to any software since finding consecutive global segments is the underlying principle for the successful recognition of individuals.
\end{itemize}
We will continue updating the software, increasingly addressing the above issues (and likely others), as well as potentially adding new features. During development we noticed a couple of areas where improvements could be made, both theoretical and practical in nature. Specifically, incremental improvements in analysis speed could be made regarding visual identification by using the trained network more sporadically -- e.g. it is not necessary to predict every image of very long consecutive segments, since, even with fewer samples, prediction values are likely to converge to a certain value early on. A likely more potent change would be an improved "uniqueness" algorithm, which, during the accumulation phase, is better at predicting which consecutive segment will improve training results the most. This could be done, for example, by taking into account the variation between images of the same individual. Other planned extensions include:
\begin{itemize}
\item (Feature): We want to have a more general interface available to users, so they can create their own plugins. Working with the data in live-mode, while applying their own filters. As well as specifically being able to write a plugin that can detect different species/annotate them in the video.
\item (Crossing solver): Additional method optimized for splitting overlapping, solid-color objects. The current method, simply using a threshold, is effective for many species but often produces large holes when splitting objects consisting of largely the same color.
\end{itemize}
To obtain the most up-to-date version of \TRex{}, please download it at \href{https://trex.run}{trex.run} or update your existing installation according to our instructions listed on \href{https://trex.run/docs/install.html}{trex.run/docs/install.html}.
\section{Materials \& Methods}
\changemade{In the following sections we describe the methods implemented in \TRex{} and \TGrabs{}, as well as their most important features in a typical order of operations (see \figref{fig:pipeline_overview} for a flow diagram), starting out with a raw video. We will then describe how trajectories are obtained and end with the most technically involved features.}
% \newcolumntype{M}{>{\begin{varwidth}{4cm}}l<{\end{varwidth}}}
% \begin{table}[h]
% % Use "S" column identifier to align on decimal point
% \begin{tabular}{p{2cm}|p{4cm}|p{3.5cm}|p{3cm}}%|p{2cm}}
% \toprule
% Resource & Designation & Source or reference & Identifiers \\ %& Additional information \\
% \midrule
% Syrupy & Measuring memory consumption during the runtime of a process & \protect\href{https://github.com/jeetsukumaran/Syrupy}{github/jeetsukumaran} \\
% ps\_mem & Adding additional information to Syrupy output & \href{http://www.pixelbeat.org/scripts/ps_mem.py}{pixelbeat.org} \\
% Jupyter Lab & Analysis & \href{https://github.com/jupyterlab/jupyterlab}{github/jupyterlab} & \protect\path{RRID:SCR_018315}\\
% Python & Analysis & \href{https://python.org}{python.org} & \protect\path{RRID:SCR_008394} \\
% Debian & Operating system & &\protect\path{RRID:SCR_006638}\\
% \bottomrule
% \end{tabular}
% \medskip
% %\tabledata{\changemade{}
% \caption{Do I need this}
% \end{table}
\subsection{Segmentation}
When an image is first received from a camera (or a video file), the objects of interest potentially present in the frame must be \changemade{found} and cropped out. Several technologies are available to separate the foreground from the background (segmentation). Various machine learning algorithms are frequently used to great effect, even for the most complex environments (\citealt{hughey2018challenges}, \citealt{robie2017machine}, \citealt{francisco2019low}). These more advanced approaches are typically beneficial for the analysis of field-data or organisms that are very hard to see in video (e.g. very transparent or low contrast objects/animals in the scene). \changemade{In these situations, where integrated methods might not suffice, it is possible to segment objects from the background using external, e.g. deep-learning based, tools (see next paragraph).} However, for most laboratory experiments, simpler (and also much faster), classical image-processing methods yield satisfactory results. \changemade{Thus, we provide as a generically-useful capability \emph{background-subtraction}, which is the default method by which objects are segmented. This can be used immediately in experiments where the background is relatively static. Backgrounds are generated automatically by uniformly sampling images from the source video(s) -- different modes are available (min/max, mode and mean) for the user to choose from. More advanced image-processing techniques like luminance equalization (which is useful when lighting varies between images), image undistortion, and brightness/contrast adjustments are available in \TGrabs{} and can enhance segmentation results -- but come at the cost of slightly increased processing time.} Importantly, since many behavioral studies rely on $\ge$ 4K resolution videos, we heavily utilize the GPU (if available) to speed up most of the image-processing, allowing \TRex{} to scale well with increasing image resolution.
\changemade{\TGrabs{} can generally find any object in the video stream, and subsequently pass it on to the tracking algorithm (next section),} as long as either (i) the background is relatively static while the objects move at least occasionally, (ii) the objects/animals of interest have enough contrast to the background or (iii) the user provides an additional binary mask per frame which is used to separate the objects \changemade{of interest} from the background, the typical means of doing this being by deep-learning based segmentation (e.g. \citealt{Cae+17}). These masks are expected to be in a video-format themselves and correspond 1:1 in length and dimensions to the video that is to be analyzed. They are expected to be binary, marking individuals in white and background in black. Of course, these binary videos could be \changemade{used} on their own, but would not retain grey-scale information \changemade{of the objects}. There are a lot of possible applications where this could be useful; but generally, whenever individuals are really hard to detect visually and need to be recognized by a different software (e.g. a machine-learning-based \changemade{segmentation} like \citealt{Man+18b}). Individual frames can then be connected using our software as a second step.
The detected objects are saved to a custom non-proprietary compressed file format (Preprocessed Video or \protect\path{PV}, see appendix \nameref{sec:pv_files}), that stores only the most essential information from the original video stream: the objects and their pixel positions and values. This format is optimized for quick random index access by the tracking \changemade{algorithm (see next section)} and stores other meta-information (like frame timings) utilized during playback or analysis. When recording videos directly from a camera, they can also be streamed to an additional and independent MP4 container format (plus information establishing the mapping between \protect\path{PV} and MP4 video frames).
\subsection{Tracking} \label{sec:tracking}
Once animals (or, more generally, termed "objects" henceforth) have been successfully segmented from the background, we can either use the live-tracking feature in \TGrabs{} or open a pre-processed file in \TRex{}, to generate the trajectories of these objects. This process uses information regarding an object's movement (i.e. its kinematics) to follow it across frames, estimating future positions based on previous velocity and angular speed. It will be referred to as "tracking" in the following text, and is a required step in all workflows.
Note that this approach alone is very fast, but, as will be shown, is subject to error with respect to maintaining individual identities. If that is required, there is a further step, outlined in \nameref{sec:visual_recognition} below, which can be applied at the cost of processing speed. First, however, we will discuss the general basis of tracking, which is common to approaches that do, and do not, require identities to be maintained with high-fidelity. Tracking can occur for two distinct categories, which are handled slightly differently by our software:
\begin{enumerate}
\item there is a known number of objects
\item there is an unknown number of objects
\end{enumerate}
The first case assumes that the number of tracked objects in a frame cannot exceed a certain expected number of objects (\changemade{calculated} automatically\changemade{,} or set by the user). This allows the algorithm to make stronger assumptions, for example regarding noise, where otherwise "valid" objects (conforming to size expectations) are ignored due to their positioning in the scene (e.g. too far away from previously lost individuals). In the second case, new objects may be generated until all viable objects in a frame are assigned. While being more susceptible to noise, this is useful for tracking a large number of objects, where counting objects may not be possible, or where there is a highly variable number of objects to be tracked.
For a given video, our algorithm processes every frame sequentially, extending existing trajectories (if possible) for each of the objects found in the current frame. Every object can only be assigned to one trajectory, but some objects may not be assigned to any trajectory (e.g. in case the number of objects exceeds the allowed number of individuals) and some trajectories might not be assigned to any object (e.g. while objects are out of view). To estimate object identities across frames we use an approach akin to the popular Kalman filter \citep{kalman1960new} which makes predictions based on multiple noisy data streams (here, positional history and posture information).
In the initial frame, objects are simply assigned from top-left to bottom-right. In all other frames, assignments are made based on probabilities (see appendix \nameref{sec:matching_graph}) calculated for every combination of object and trajectory. These probabilities represent the degree to which the program believes that "it makes sense" to extend an existing trajectory with an object in the current frame, given its position and speed. Our tracking algorithm only considers assignments with probabilities larger than a certain threshold, generally constrained to a certain proximity around an object assigned in the previous frame.
Matching a set of objects in one frame with a set of objects in the next frame is representative of a typical assignment problem, which can be solved in polynomial time (e.g. using the Hungarian method \citealt{kuhn1955hungarian}). However, we found that, in practice, the computational complexity of the Hungarian method can constrain analysis speed to such a degree that we decided to implement a custom algorithm, which we term tree-based matching, which has a better \textit{average-case} performance (see evaluation), even while having a comparatively bad \textit{worst-case} complexity. Our algorithm constructs a tree of all possible object/trajectory combinations in the frame and tries to find a compatible (such that no objects/trajectories are assigned twice) set of choices, maximizing the sum of probabilities amongst these choices (described in detail in the appendix \nameref{sec:matching_graph}). Problematic are situations where a large number of objects are in close proximity of one another, since then the number of possible sets of choices grows exponentially. These situations are avoided by using a mixed approach: tree-based matching is used most of the time, but as soon as the combinatorical complexity of a certain situation becomes too great, our software falls back on using the Hungarian method. If videos are known to be problematic throughout (e.g. with >100 individuals consistently very close to each other), the user may choose to use an approximate method instead (described in the appendix \autoref{sec:matching_graph}), which simply iterates through all objects and assigns each to the trajectory for which it has the highest probability and subsequently does not consider whether another object has an even higher probability for that trajectory. While the approximate method scales better with an increasing number of individuals, it is "wrong" (seeing as it does not consider all possible combinations) -- which is why it is not recommended unless strictly necessary. However, since it does not consider all combinations, making it more sensitive to parameter choice, it scales better for very large numbers of objects and produces results good enough for it to be useful in very large groups (see \tableref{decisions}). %The requirement being well-chosen parameters, such as maximum speed, to reduce the number of possible mistakes/choices per individual as much as possible.
% that it may occasionally spike in terms of the time it takes to analyse a frame --, it performs better in almost all cases. Since problems may occur with a large number of objects in close proximity of one another, where computational complexity grows exponentially, we employ a mixed approach: Generally, \TRex{} uses the tree-based algorithm, but circumvents problematic situations by falling back on using the Hungarian method when necessary. It also offers the option to use an \textit{approximate} matching algorithm for an entire video, in case the video is especially problematic throughout. This approximate algorithm is not mathematically \textit{correct}, in that it works based on a first-come-first-serve principle: It iterates through all objects and assigns each object to the trajectory for which it has the highest probability. This is of course wrong, seeing as it does not consider all possible combinations, but scales significantly better for very large numbers of objects and produces results good enough for it to be useful in very large groups (see evaluation \nameref{sec:evaluation_accuracy}).
%Since matching is a global optimization, choosing sub-optimally on an object level can still result in an overall greater probability sum. While being "wrong" mathematically speaking, and not yielding a significant improvement in performance for a few objects, combinatorically it still scales significantly better for very large numbers of objects and produces good enough results to be useful in very large groups.
Situations where objects/individuals are touching, partly overlapping, or even completely overlapping, is an issue that all tracking solutions have to deal with in some way. The first problem is the \textit{detection} of such an overlap/crossing, the second is its \textit{resolution}. \idtracker{}, for example, deals only with the first problem: It trains a neural network to detect crossings and essentially ignores the involved individuals until the problem is resolved by movement of the individuals themselves. However, using such an image-based approach can never be fully independent of the species or even video (it has to be retrained for each specific experiment) while also being time-costly to use. In some cases the size of objects might indicate that they contain multiple overlapping objects, while other cases might not allow for such an easy distinction -- e.g. when sexually dimorphic animals (or multiple species) are present at the same time. We propose a method, similar to \verb!xyTracker! in that it uses the object's movement history to detect overlaps. If there are fewer objects in a region than would be expected by looking at previous frames, an attempt is made to split the biggest ones in that area. The size of that area is estimated using the maximal speed objects are allowed to travel per frame (parameter, see documentation \protect\path{track_max_speed}). This, of course, requires relatively good predictions or, alternatively, high frame-rates relative to the object's movement speeds (which are likely necessary anyway to observe behavior at the appropriate time-scales).
By default, objects suspected to contain overlapping individuals are split by thresholding their background-difference image (see appendix \autoref{box:splitting-algorithm}), continuously increasing the threshold until the expected number (or more) similarly sized objects are found. Greyscale values and, more generally, the shading of three-dimensional objects and animals often produces a natural gradient (see for example \figref{fig:datasets_comparison}) making this process surprisingly effective for many of the species we tested with. Even when there is almost no visible gradient and thresholding produces holes inside objects, objects are still successfully separated with this approach. Missing pixels from inside the objects can even be regenerated afterwards. The algorithm fails, however, if the remaining objects are too small or are too different in size, in which case the overlapping objects will not be assigned to any trajectory until all involved objects are found again separately in a later frame.
After an object is assigned to a specific trajectory, two kinds of data (posture and visual-fields) are calculated and made available to the user, which will each be described in one of the following subsections. In the last subsection, we outline how these can be utilized in real-time tracking situations.
\subsubsection{Posture Analysis}
Groups of animals are often modeled as systems of simple particles (\citealt{inada2002order}, \citealt{cavagna2010empirical}, \citealt{perez2011collective}), a reasonable simplification which helps to formalize/predict behavior. However, intricate behaviors, like courtship displays, can only be fully observed once the body shape and orientation are considered (e.g. using tools such as DeepPoseKit, \citealt{graving2019deepposekit}, LEAP \cite{pereira2019fast}/SLEAP \cite{Pereira2020.08.31.276246}, and DeepLabCut, \citealt{mathis2018deeplabcut}). \TRex{} does not track individual body parts apart from the head and tail (where applicable), but even the included simple and fast 2D posture estimator already allows for deductions to be made about how an animal is positioned in space, bent and oriented -- crucial e.g. when trying to estimate the position of eyes/antennae as part of an analysis, where this is required (e.g. \citealt{strandburg2013visual}, \citealt{rosenthal2015revealing}). \changemade{When detailed tracking of all extremities is required, \TRex{} offers an option that allows it to interface with third-party software like DeepPoseKit (\citealt{graving2019deepposekit}), SLEAP (\citealt{Pereira2020.08.31.276246}), or DeepLabCut (\citealt{mathis2018deeplabcut}). This option (\protect\path{output_image_per_tracklet}), when set to true, exports cropped and (optionally) normalised videos per individual that can be imported directly into these tools -- where they might perform better than the raw video. Normalisation, for example, can make it easier for machine-learning algorithms in these tools to learn where body-parts are likely to be (see \figref{fig:maximum_val_acc_per_samples}) and may even reduce the number of clicks required during annotation.}
In \TRex{}, the 2D posture of an animal consists of (i) an outline around the outer edge of a blob, (ii) a center-line (or midline for short) that curves with the body and (iii) positions on the outline that represent the front and rear of the animal (typically head and tail). Our only assumptions here are that the animal is bilateral with a mirror-axis through its center and that it has a beginning and an end, and that the camera-view is roughly perpendicular to this axis. This is true for most animals, but may not hold e.g. for jellyfish (with radial symmetry) or animals with different symmetries (e.g. radiolaria (protozoa) with spherical symmetry). Still, as long as the animal is not exactly circular from the perspective of the camera, the midline will follow its longest axis and a posture can be estimated successfully. The algorithm implemented in our software is run for every (cropped out) image of an individual and processes it as follows:
i. A tree-based approach follows edge pixels around an object in a clock-wise manner. Drawing the line \emph{around} pixels, as implemented here, instead of through their centers, as done in comparable approaches, helps with very small objects (e.g. one single pixel would still be represented as a valid outline, instead of a single point).
ii. The pointiest end of the outline is assumed, by default, to be either the tail or the head (based on curvature and area between the outline points in question). Assignment of head vs. tail can be set by the user, seeing as some animals might have "pointier" heads than tails (e.g. termite workers, one of the examples we employ). Posture data coming directly from an image can be very noisy, which is why the program offers options to simplify outline shapes using an Elliptical Fourier Transform (EFT, see \citealt{iwata2015genomic}, \citealt{kuhl1982elliptic}) or smoothing via a simple weighted average across points of the curve (inspired by common subdivision techniques, see \citealt{warren2001subdivision}). The EFT allows for the user to set the desired level of approximation detail (via the number of elliptic fourier descriptors, EFDs) and thus make it "rounder" and less jittery. Using an EFT with just two descriptors is equivalent to fitting an ellipse to the animal's shape (as, for example, \verb!xyTracker! does), which is the simplest supported representation of an animal's body.
iii. The reference-point chosen in (ii) marks the start for the midline-algorithm. It walks both left and right from this point, always trying to move approximately the same distance on the outline (with limited wiggle-room), while at the same time minimizing the distance from the left to the right point. This works well for most shapes and also automatically yields distances between a midline point and its corresponding two points on the outline, estimating thickness of this object's body at this point.
Compared to the tracking itself, posture estimation is a time-consuming process and can be disabled. It is, however, required to estimate -- and subsequently normalize -- an animal's orientation in space (e.g. required later in \nameref{sec:visual_recognition}), or to reconstruct their visual field as described in the following sub-section.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{figures/screenshot_visual_field.jpg}
\caption{Visual field estimate of the individual in the center (zoomed in, the individuals are approximately 2-3cm long, \videoref{vid:guppy_8_t36_d15_20191212_085800}). Right (blue) and left (orange) fields of view intersect in the binocular region (pink). Most individuals can be seen directly by the focal individual (1, green), which has a wide field of view of $260^\circ$ per eye. Individual 3 on the top-left is not detected by the focal individual directly and not part of its first-order visual field. However, second-order intersections (visualized by grey lines here) are also saved and accessible through a separate layer in the exported data.}
\label{fig:occlusion}
\videosupp{\changemade{A clip from \videoref{vid:guppy_8_t36_d15_20191212_085800}, showing \TRex{'} visual-field estimation for Individual 1. \url{https://youtu.be/yEO_3lpZIzU}}}
\end{figure}
\subsubsection{Reconstructing 2D Visual Fields}
Visual input is an important modality for many species (e.g. fish \citealt{strandburg2013visual}, \citealt{bilotta2001zebrafish} and humans \citealt{colavita1974human}). Due to its importance in widely used model organisms like zebrafish (\emph{Danio rerio}), we decided to include the capability to conduct a 2-dimensional reconstruction of each individual's visual field as part of the software. The requirements for this are successful posture estimation and that individuals are viewed from above, as is usually the case in laboratory studies.
The algorithm makes use of the fact that outlines have already been calculated during posture estimation. Eye positions are estimated to be evenly distanced from the "snout" and will be spaced apart depending on the thickness of the body at that point (the distance is based on a ratio, relative to body-size, which can be adjusted by the user). Eye orientation is also adjustable, which influences the size of the stereoscopic part of the visual field. We then use ray-casting to intersect rays from each of the eyes with all other individuals as well as the focal individual itself (self-occlusion). Individuals not detected in the current frame are approximated using the last available posture. Data are organized as a multi-layered 1D-image of fixed size for each frame, with each image prepresenting angles from $-180^{\circ}$ to $180^{\circ}$ for the given frame. Simulating a limited field-of-view would thus be as simple as cropping parts of these images off the left and right sides. The different layers per pixel encode:
\begin{enumerate}
\item identity of the occluder
\item distance to the occluder
\item body-part that was hit (distance from the head on the outline in percent)
\end{enumerate}
While the individuals viewed from above on a computer screen look 2-dimensional, one major disadvantage of any 2D approach is, of course, that it is merely a projection of the 3D scene. Any visual field estimator has to assume that, from an individual's perspective, other individuals act as an occluder in all instances (see \figref{fig:occlusion}). This may only be partly true in the real world, depending on the experimental design, as other individuals may be able to move slightly below, or above, the focal individuals line-of-sight, revealing otherwise occluded conspecifics behind them. We therefore support multiple occlusion-layers, allowing second-order and $N$th-order occlusions to be calculated for each individual. %\emph{Utilizing individual size differences in consecutive frames (e.g. because of diving) might even yield a proxy for approximating occlusions during post-processing.}
% This also holds true for a commonly used model-organism in behavioral ecology: zebrafish larvae (\emph{Danio rerio}), who, up to a certain age, due to anatomical restrictions do not possess other sensing abilities (\emph{cite}).
\subsubsection{Realtime Tracking Option for Closed-Loop Experiments}
Live tracking is supported, as an option to the user, during the recording, or conversion, of a video in \TGrabs{}. When closed-loop feedback is enabled, \TGrabs{} focusses on maintaining stable recording frame-rates and may not track recorded frames if tracking takes too long. This is done to ensure that the recorded file can later be tracked again in full/with higher accuracy (thus no information is lost) if required, and to help the closed-loop feedback to stay synchronized with real-world events.
During development we worked with a mid-range gaming computer and Basler cameras at $90$fps and $2048^2$px resolution, where drawbacks did not occur. \changemade{Running the program on hardware with specifications below our recommendations (see \nameref{ref:hardware_recommend}), however, may affect frame-rates as described below.}
\TRex{} loads a prepared \verb!Python! script, handing down an array of data per individual in every frame. Which data fields are being generated and sent to the script is selected by the script. Available fields are:
\begin{itemize}[label=\textnormal{$\bullet$}]
\item Position
\item Midline information
\item Visual field
\end{itemize}
If the script (\changemade{or any other part of the recording process}) takes too long to execute \changemade{in one} frame, \changemade{consecutive frames may be} dropped until a stable frame-rate can be achieved. This scales well for all computer-systems, \changemade{but results in fragmented tracking data, causing worse identity assignment, and reduces the number of frames and quality of data available for closed-loop feedback. However, since even untracked frames are saved to disk, these inaccuracies can be fixed in \TRex{} later. Alternatively, if live-tracking is enabled but closed-loop feedback is disabled, the program maintains detected objects in memory and tracks them in an asynchronous thread (potentially introducing wait time after the recording stops).} When the program terminates, the tracked individual's data are exported -- along with a \verb!results! file that can be loaded by the \verb!tracker! at a later time.
In order to make this interface easy to use for prototyping and to debug experiments, the script may be changed during its run-time and will be reloaded if necessary. Errors in the \verb!Python! code lead to a temporary pause of the closed-loop part of the program (not the recording) until all errors have been fixed.
Additionally, thanks to \verb!Python! being a fully-featured scripting language, it is also possible to call and send information to other programs during real-time tracking. Communication with other external programs may be necessary whenever easy-to-use \verb!Python! interfaces are not available for e.g. hardware being used by the experimenter.
%\textit{the following is basically a lie, until i can actually get back to the lab:}
%Closed-loop has been tested with up to 30 (mock) individuals present at the same time. With posture (and thus also visual field) being a very costly process, more individuals can be tracked at higher speeds with it disabled.
\subsection{Automatic Visual Identification Based on Machine Learning} \label{sec:visual_recognition}
Tracking, when it is only based on individual's positional history, can be very accurate under good circumstances and is currently the fastest way to analyse video recordings or to perform closed-loop experiments. However, such tracking methods simply do not have access to enough information to allow them to ensure identities are maintained for the duration of most entire trials -- small mistakes can and will happen. There are cases, e.g. when studying polarity (only based on short trajectory segments), or other general group-level assessments, where this is acceptable and identities do not have to be maintained perfectly. However, consistent identities are required in many individual-level assessments, and with no baseline truth available to correct mistakes, errors start accumulating until eventually all identities are fully shuffled. Even a hypothetical, \emph{perfect} tracking algorithm will not be able to yield correct results in all situations as multiple individuals might go out of view at the same time (e.g. hiding under cover or just occluded by other animals). There is no way to tell who is whom, once they re-emerge.
The only way to solve this problem is by providing an independent source of information from which to infer identity of individuals, which is of course a principle we make use of all the time in our everyday lives: Facial identification of con-specifics is something that \changemade{is easy for most humans}, to an extent where we sometimes recognize face-like features where there aren't any. Our natural tendency to find patterns enables us to train experts on recognizing differences between animals, even when they belong to a completely different taxonomic order. Tracking individuals is a demanding task, especially with large numbers of moving animals (\citealt{liu2009effect} shows humans to be effective for up to 4 objects). Human observers are able to solve simple memory recall tasks for 39 objects at only 92\% correct (see \citealt{humphrey1992recognizing}), where the presented objects do not even have to be identified individually (just classified as old/new) and contain more inherent variation than most con-specific animals would. Even with this being true, human observers are still the most efficient solution in some cases (e.g. for long-lived animals in complex habitats). Enhancing visual inter-individual differences by attaching physical tags is an effective way to make the task easier and more straight-forward to automate. RFID tags are useful in many situations, but are also limited since individuals have to be in very close proximity to a sensor in order to be detected \citep{bonter2011applications}. Attaching \changemade{fiducial markers (such as QR codes)} to animals allows for a very large number \changemade{(thousands) of individuals to be uniquely identified at the same time (see \citealt{Gernat1433}, \citealt{Wild2020.05.06.076943}, \citealt{mersch2013tracking}, \citealt{crall2015beetag}) -- and over a much greater distance than RFID tags.} Generating codes can also be automated, generating tags with optimal visual inter-marker distances \citep{garrido2016generation}, making it feasible to identify a large number of individuals with minimal tracking mistakes.
While physical tagging is often an effective method by which to identify individuals, it requires animals to be caught and manipulated, which can be difficult \citep{mersch2013tracking} and is subject to the physical limitations of the respective system. Tags have to be large enough so a program can recognize it in a video stream. Even worse, especially with increased relative tag-size, the animal's behavior may be affected by the presence of the tag \changemade{or during its application (\citealt{DENNIS20081939}, \citealt{pankiw2003effect}, \citealt{SOCKMAN2001205}),} and there might be no way for experimenters to necessarily know that it did \changemade{(unless with considerable effort, see \citealt{switzer2016bombus})}. In addition, for some animals, like fish and termites, attachment of tags that are effective for discriminating among a large number of individuals can be problematic, or impossible.
Recognizing such issues, \citep{idtracker} first proposed an algorithm termed \textit{idtracker}, generalizing the process of pattern recognition for a range of different species. Training an expert program to tell individuals apart, by detecting slight differences in patterning on their bodies, allows the correction of identities without any human involvement. Even while being limited to about 15 individuals per group, this was a very promising approach. It became much improved upon only a few years later by the same group in their software \idtracker{} \citep{idtrackerai}, implementing a paradigm shift from explicit, hard-coded, color-difference detection to using more general machine learning methods instead -- increasing the supported group size by an order of magnitude.
%It works in stages (or \textit{protocols}) and adapts to problem complexity by skipping later steps if the estimated training quality is deemed good enough. Stages otherwise build on previous progress, continually improving results. First, individuals are tracked by (i) detecting objects of interest and (ii) following them in the next frame by finding objects overlapping with the pixels from the last frame. In order to ensure that individuals do not merge, a secondary network is trained to distinguish between crossing and singular individuals. After individuals have been tracked, a part of the video is selected where all individuals are visible and separated from each other. This \textit{global segment} marks a starting point for the following training procedure, ensuring individual sequences to be unobstructed by crossings or other visibility issues. While a \emph{global segment} spans a certain (short) range of frames, lengths of associated segments per individual may vary: for each individual, frames extending before and after the global segment can be assumed to be correctly assigned as well, until the individual "disappears" (e.g. overlaps with another individual, moves too fast, etc.).
%Using the first set of generated samples, training commences until certain stopping criteria are fulfilled (in their paper referenced as \emph{protocol 1}). If the training quality is deemed to be good enough, training may stop here. This is the case, if (i) no two individuals are predicted to be of the same identity and (ii) the predicted probabilities per individual are certain enough. If these conditions do \textit{not} hold, other global segments have to be added to the training dataset (\emph{protocol 2}). This procedure extends the dataset step by step, until at least $99.95\%$ of images in global segments have been accumulated. The last protocol, \emph{protocol 3}, trains the first half of the network (convolutional layers) separately from the rest and then starts iterating \emph{protocol 2} again - this time only training the classification (dense) part of the network (see \figref{fig:software_overview}c).
%\subsection{Visual identification in \TRex{}}% \label{sec:preparation}
We employ a method for visual identification in \TRex{} that is similar to the one used in \idtracker{}, where a neural network is trained to visually recognize individuals and is used to correct tracking mistakes automatically, without human intervention -- the network layout (see \figref{fig:software_overview}c) is almost the same as well (differing only by the addition of a pre-processing layer and using 2D- instead of 1D-dropout layers). However, in \TRex{}, processing speed and chances of success are improved (the former being greatly improved) by (i) minimizing the variance landscape of the problem and (ii) exploring the landscape to our best ability, optimally covering all poses and lighting-conditions an individual can be in, as well as (iii) shortening the training duration by significantly altering the training process -- e.g. choosing new samples more adaptively and using different stopping-criteria (accuracy, as well as speed, are part of the later evaluation).
While \nameref{sec:tracking} already \textit{tries} to (within each trajectory) consistently follow the same individual, there is no way to ensure/check the validity of this process without providing independent identity information. Generating this source of information, based on the visual appearance of individuals, is what the algorithm for visual identification, described in the following subsections, aims to achieve. Re-stated simply, the goal of using automatic visual identification is to obtain reliable predictions of the identities of all (or most) objects in each frame. Assuming these predictions are of sufficient quality, they can be used to detect and correct potential mistakes made during \nameref{sec:tracking} by looking for identity switches within trajectories. Ensuring that predicted identities within trajectories are consistent, by proxy, also ensures that each trajectory is consistently associated with a single, real individual. In the following, before describing the four stages of that algorithm, we will point out key aspects of how tracking/image data are processed and how we addressed the points (i)-(iii) above and especially highlight the features that ultimately improved performance compared to other solutions. %and highlight differences to \idtracker{} where applicable.
%As is described in the following sub-sections, a network is trained, and continuously improved, to discriminate between individuals from within the same group.
%Regarding the specific aspect of automated visual recognition, the main contribution of this paper lies in the way training is approached to address some of the areas in which comparable solutions left room for optimization. Chances of success are improved, compared to other approaches, by (i) minimizing the variance landscape of the problem and (ii) exploring the landscape to our best ability, optimally covering all poses and lighting-conditions an individual can be in, as well as (iii) shortening the training duration by choosing new samples more adaptively. To address the latter, we introduce a measure of training quality that we call \textit{uniqueness} (see Box \ref{box:uniqueness_score}), which is integral to every part of the algorithm. Instead of accumulating samples in a pre-determined order or based on kinematic properties of the individuals themselves, \textit{uniqueness} is more global and dynamic while being tied directly to the networks predictive ability. Most importantly, it gives us a way to determine when to terminate early or which segment should be added next. Broadly speaking, additional samples are only added to the training dataset if they are deemed promising or can be ignored if we do not expect them to provide us with new information.
\subsubsection{Preparing Tracking-Data} \label{sec:segments}
Visual identification starts out only with the trajectories that the \nameref{sec:tracking} provides.
Tracking, on its own, is already an improvement over other solutions, especially since (unlike e.g. \idtracker{}) \TRex{} makes an effort to separate overlapping objects (see the \nameref{box:splitting-algorithm}) and thus is able to keep track of individuals for longer (see \figref{fig:segment_lengths}). Here, we -- quite conservatively -- assume that, after every problematic situation (defined in the list below), the assignments made by our tracking algorithm are wrong. Whenever a problematic situation is encountered as part of a trajectory, we split the trajectory at that point. This way, all trajectories of all individuals in a video become an assortment of trajectory snippets (termed "segments" from here on), which are clear of problematic situations, and for each of which the goal is to find the correct identity ("correct" meaning that identities are consistently assigned to the same \textit{real} individual throughout the video). Situations are considered "problematic", and cause the trajectory to be split, when:
%The most important factors to consider are (i) avoiding the switching of identities or at least determining that such a switching could have happened, and (ii) optimizing variation within the samples gathered per individual; while (iii) bounding computation time as well as (iv) keeping memory consumption within reasonable limits.
%In order to address (i), consecutive and uninterrupted frame segments are first calculated at an individual level during the \nameref{sec:tracking} step. The length of each of those segments is limited by various factors:
\begin{itemize}[label=\textnormal{$\bullet$}]
\item \textbf{The individual has been lost for at least one frame.} For example when individuals are moving unexpectedly fast, are occluded by other individuals/the environment, or simply not present anymore (e.g. eaten).
\item \textbf{Uncertainty of assignment was too high ($>50\%$)} e.g. due to very high movement speeds or extreme variation in size between frames. With simpler tracking tasks in mind, these segments are kept as \emph{connected} tracks, but regarded as separate ones here.
\item \textbf{Timestamps suggest skipped frames.} Missing frames in the video may cause wrong assignments and are thus treated as if the individuals have been lost. This distinction can only be made if accurate frame timings are available (when recording using \TGrabs{} or provided alongside the video files in separate \protect\path{npz} files).
\end{itemize}
Unless one of the above conditions becomes true, a segment is assumed to be consecutive and connected; that is, throughout the whole segment, no mistakes have been made that lead to identities being switched. Frames where all individuals are currently within one such segment at the same time will henceforth be termed \emph{global segments}.
Since we know that there are no problematic situations inside each per-individual segment, and thus also not across individuals within the range of a global segment, we can choose any global segment as a basis for an initial, arbitrary assignment of identities to trajectories. One of the most important steps of the identification algorithm then becomes deciding which global segment is the best starting point for the training. If a mistake is made here, consecutive predictions for other segments will fail and/or produce unreliable results in general. %The next sub-section describes the process of estimating the quality of global segments, allowing the program to assign a priority to each one and order them accordingly.
%\subsubsection{Calculating the quality of global segments}
Only a limited set of global segments is kept -- striking a balance between respecting user-given constraints and capturing as much of the variance as possible. In many of the videos used for evaluation, we found that only few segments had to be considered -- however, computation time is ultimately bounded by reducing the number of qualifying segments. While this is true, it is also beneficial to avoid auto-correlation by incorporating samples from all sections of the video instead of only sourcing them from a small portion -- to help achieve a balance, global segments are binned by their middle frame into four bins (each quarter of the video being a bin) and then reducing the number of segments inside each bin. With that goal in mind, we sort the segments within bins by their "quality" -- a combination of two factors:
\begin{enumerate}
\item To capture as much as possible the variation due to an individual's own movement, as well as within the background that it moves across, a "good" segment should be a segment where all individuals move as much as possible and also travel as large a distance as possible. Thus, we derive a per-individual \textit{spatial coverage descriptor} for the given segment by dissecting the arena (virtually) into a grid of equally sized, rectangular "cells" (depending on the aspect ratio of the video). Each time an individual's center-point moves from one cell to the next, a counter is incremented for that individual. To avoid situations where, for example, all individuals but one are moving, we only use the lowest per-individual spatial coverage value to represent a given segment.
\item It is beneficial to have more examples for the network to learn from. Thus, as a second sorting criterion, we use the average number of samples per individual.
\end{enumerate}
After being sorted according to these two metrics, the list of segments per bin is reduced, according to a user-defined variable (4 by default), leaving only the most viable options per quarter of video.
The number of visited cells may, at first, appear to be essentially equivalent to a spatially normalized \textit{distance travelled} (as used in \idtracker{}). In edge cases, where individuals never stop or always stop, both metrics can be very similar. However, one can imagine an individual continuously moving around in the same corner of the arena, which would be counted as an equally good segment for that individual as if it had traversed the whole arena (and thus capturing all variable environmental factors). In most cases, using highly restricted movement for training is problematic, and worse than using a shorter segment of the individual moving diagonally through the entire space, since the latter captures more of the variation within background, lighting conditions and the animals movement in the process.
%Addressing the last point, memory usage is important since memory is a limited resource. Exceeding the hardware-prescribed memory capacity can subsequently lead to significant slowdowns for the entire system, or even terminate the program ungracefully. In \TRex{}, memory usage during training limited to user-specified criteria by sub-sampling the available images uniformly within global segments. Of course, information could potentially be lost in the process seeing as it could be present in frame $f$ and not $f+1$ -- and information \textit{will} be lost if limits are too strict. However, generally, this does not pose a problem since consecutive frames are usually highly auto-correlated -- at least as long as frame-rates are high enough relative to the timescale of the individual's behavior.
%A network is trained, and continuously improved, to discriminate between individuals from within the same group. Chances of success are improved by (i) minimizing the variance landscape of the problem and (ii) exploring the landscape to our best ability, optimally covering all poses and lighting-conditions an individual can be in.
%Simply re-stating the goal can be helpful sometimes: If every step has been completed successfully, the goal is to end up with a set of identities assigned to each individual representative of the \textit{real} individual. Which is an important distinction to make when comparing to "normal" tracking without identity information. We start out on the basis of "normal" tracking, which tries to consistently assign to a distinct individual. We segment the time-series into smaller segments which are easier to validate.
%Although training in \emph{TRex} follows similar principles as \idtracker{}, with the network being largely identical, and steps during training could be named similarly, almost every part of the algorithm has either been altered or replaced (like the costly \textit{protocol 3}).
%The attractiveness of a global segment (\textit{uniqueness}, see Box \ref{box:uniqueness_score}) may change after every training unit and is based directly on predictions offered by the network itself. This evaluation method importantly also offers a way to terminate accumulation early, when the achieved score becomes sufficiently high.
%\subsubsection{Preparation}
%Within each bin, segments are sorted by their quality, according to a \textit{segment quality index}, consisting of (i) the average number of samples per individual and (ii) the minimum number of grid-cells (defined in the next sentence) visited per individual. The grid is a virtual dissection of the arena into equally sized rectangles ("cells"). Each time an individual's center-point moves from one cell to the next, a counter is incremented. This way, we obtain a per-individual spatial coverage descriptor for the given segment, which is used as the first criterion for sorting among the segments within a bin (per segment, we only count the value of the individual with the worst spatial coverage in that segment). The second criterion is simply the average number of samples The maximum remaining number of segments in each bin is limited by a user-defined variable (4 by default), leaving only the most viable options per quarter of video.
%The quality index consists of (i) the minimum number of grid-cells\footnote{} visited per individual and (ii) the average number of samples per individual. First, the arena is (virtually) dissected into a grid of equally sized cells. Each time an individual's center-point moves from one cell to the next, a counter is incremented. This way, we obtain a per-individual spatial coverage descriptor for the given segment. The maximum remaining number of segments in each bin is limited by a user-defined variable (4 by default), leaving only the most viable options per quarter of video.
%This set of global segments is sorted by a \emph{segment quality index}. The quality index consists of (i) the minimum number of grid-cells (as defined in the next sentence) visited per individual and (ii) the average number of samples per individual. First, the arena is (virtually) dissected into a grid of equally sized cells. Each time an individual's center-point moves from one cell to the next, a counter is incremented. This way, we obtain a per-individual spatial coverage descriptor for the given segment. Since it is an integer with no decimal places, duplicates across segments will occur frequently, allowing for a finer-grade sorting amongst the segments. Sets of segments with an equal number of cells visited are sorted, a second time, by the average number of samples per tracked entity. Preferring longer segments over shorter segments is important, since longer segments typically capture more of the variation exhibited by individuals -- even if the travelled distance per individual is low.
%The list of global segments is reduced before the training process commences, in order to accommodate both the user-given constraints as well capturing as much of the variance as possible. In many of the videos used for evaluation, we found that only few segments had to be considered -- however, computation time is ultimately bounded by reducing the number of qualifying segments. It is beneficial to avoid auto-correlation by incorporating samples from all sections of the video instead of only sourcing them from a small portion -- to help achieve this, global segments are are binned by their middle frame into four bins (each quarter of the video being a bin). Within each bin, segments are sorted by their quality and the worst-quality segments removed. The maximum remaining number of segments in each bin is limited by a user-defined variable (4 by default), leaving only the most viable options per quarter of video.
%1. generate training samples
%2. passed to tensorflow
%3. assign potential ID to blobs (not individuals)
%4. correct cases of wrongly assigned segments
%- main differences to idtracker: posture is used for normalizing blob direction, as it also supports C shapes and is accurately recognizing all quantiles (have to test again with blob orientation instead of midline orientation?). filtering the consecutive segments can use more accurate filters like midline length. i dont resize individual images, just crop, although it is possible (!). I need to normalize histogram?
%- not aware of other software that implements recognition for an unknown number of individuals, so its not too bad i dont either?
%- not using weighted cross-entropy-loss, but focal-loss to account for imbalanced datasets
%\emph{- if we do online training and assignment (as in xyTracker), it means that we have to perform complicated backtracking in order to sort out problems. it would also mean that, judging from the fact that problem complexity (and number of individuals involved) accumulates over time, this could lead to computational cost exploding in situations with lots of consecutive crossings. in real videos, periods between crossings may also be short for all individuals, as they are usually sticking together (being the reason why they overlap in the first place), having phases of frequent crossings in close quarters. we have instead adopted a different way of resolving issues, while also keeping the process efficient enough to handle long videos. }
%- can also automatically generate an area of interest based on heatmap \emph{appendix?}
\subsubsection{Minimizing the Variance Landscape by Normalizing Samples} \label{sec:posture_normalization}
A big strength of machine learning approaches is their resistance to noise in the data. Generally, any machine learning method will likely still converge - even with noisy data. Eliminating unnecessary noise and degrees of freedom in the dataset, however, will typically help the network to converge much more quickly: Tasks that are easier to solve will of course also be solved more accurately within similar or smaller timescales. This is due to the optimizer not having to consider various parts of the possible parameter-space during training, or, put differently, shrinking the overall parameter-space to the smallest possible size without losing important information. The simplest such optimization included in most tracking and visual identification approaches is to segment out the objects and centering the individuals in the cropped out images. This means that (i) the network does not have to consider the whole image, (ii) needs only to consider one individual at a time and (iii) the corners of the image can most likely be neglected.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/fig_normalization.pdf}
\caption{Comparison of different normalization methods. Images all stem from the same video and belong to the same identity. The video has previously been automatically corrected using the visual identification. Each object visible here consists of $N$ images $M_i, i\in[0,N]$ that have been accumulated into a single image using $\min_{i\in [0,N]}M_i$, with $\min$ being the element-wise minimum across images. The columns represent same samples from the same frames, but normalized in three different ways: In (a), images have not been normalized at all. Images in (b) have been normalized by aligning the objects along their main axis (calculated using \textit{image-moments}), which only gives the axis within 0 to 180 degrees. In (c), all images have been aligned using posture information generated during the tracking process. As the images become more and more recognizable to \textit{us} from left to right, the same applies to a network trying to tell identities apart: Reducing noise in the data speeds up the learning process.}
\label{fig:datasets_comparison}
\end{figure}
Further improving on this, approaches like \idtracker{} align all objects along their most-elongated axis, essentially removing global orientation as a degree of freedom. The orientation of an arbitrary object can be calculated e.g. using an approach often referred to as image-moments \citep{hu1962visual}, yielding an angle within $[0-180]^\circ$. Of course, this means that
\begin{enumerate}
\item circular objects have a random (noisy) orientation
\item elongated objects (e.g. fish) can be either head-first or flipped by $180^\circ$ and there is no way to discriminate between those two cases (see second row, \figref{fig:datasets_comparison})
\item a C-shaped body deformation, for example, results in a slightly bent axis, meaning that the head will not be in exactly the same position as with a straight posture of the animal.
\end{enumerate}
%\begin{figure}[h]
%\begin{fullwidth}
%\includegraphics[width=1.0\linewidth]{growth_per_samples.pdf}
%\caption{Number of epochs, until a satisfactory validation accuracy $\ge95\%$ has been reached, plotted against the average number of samples provided per class for three different normalization methods. Colored background shows the 90\% and 10\% intervals. The same 77 videos have been analysed with the same parameters, except for the way that the samples are processed before adding them to the training dataset. Network training converges faster and to a higher accuracy when more samples are provided for each identity, but using posture to align images yields consistently better and more predictable results than other methods.}
%\label{fig:rawvnormalized}
%\end{fullwidth}
%\end{figure}
Each of these issues adds to the things the network has to learn to account for, widening the parameter-space to be searched and increasing computation time. However, barring the first point, each problem can be tackled using the already available posture information. Knowing head and tail positions and points along the individual's center-line, the individual's heads can be locked roughly into a single position. This leaves room only for their rear end to move, reducing variation in the data to a minimum (see \figref{fig:datasets_comparison}). In addition to faster convergence, this also results in better generalization right from the start and even with a smaller number of samples per individual (see \figref{fig:maximum_val_acc_per_samples}). \changemade{For further discussion of highly deformable bodies, such as of rodents, please see Appendix (\nameref{sec:deformable_bodies}).}
\subsubsection{Guiding the Training Process} \label{sec:training_quality}
Per batch, the stochastic gradient descent is directed by the local accuracy (a fraction of correct/total predictions), which is a simple and commonly used metric that has no prior knowledge of where the samples within a batch come from. This has the desirable consequence that no knowledge about the temporal arrangement of images is necessary in order to train and, more importantly, to apply the network later on.
In order to achieve accurate results quickly across batches, while at the same time making it possible to indicate to the user potentially problematic sequences within the video, we devised a metric that can be used to estimate local as well as global training quality: We term this uniqueness and it combines information about objects within a frame, following the principle of non-duplication; images of individuals within the same frame are required to be assigned different identities by the networks predictions.
\begin{featurebox}
\caption{Calculating uniqueness for a frame}
\label{box:uniqueness_score}
\begin{algorithm}[H]
\DontPrintSemicolon
\KwData{
frame $x$
}
\KwResult{Uniqueness score for frame $x$}
uids = map\{\}\;
$\hat{p}\given{i|b}$ is the probability of blob $b$ to be identity $i$\;
$f(x)$ returns a list of the tracked objects in frame $x$\;
$E(v) = \left(1 + \exp(-\pi)\right) / \left(1 + \exp(-\pi v)\right)$ is a shift of roughly $+0.5$ and non-linear scaling of values $0\leq v\leq 1$\;
\;
\ForEach{object $b \in f(x)$}{
$\mathrm{maxid} = \argmax{i} \hat{p}\given{i|b}$ with $i \in \mathrm{identities}$\;
\eIf{maxid $\in$ uids}{
$\mathrm{uids}[\mathrm{maxid}] = \max(\mathrm{uids}[\mathrm{maxid}], \hat{p}(\mathrm{maxid}, b))$
}{
$\mathrm{uids}[\mathrm{maxid}] = \hat{p}(\mathrm{maxid}, b)$
}
}
\Return{$|\mathrm{uids}|^{-1}|f(x)| * E\left(|\mathrm{uids}|^{-1} \left(\sum_{i \in \mathrm{uids}} \mathrm{uids}[i]\right)\right)$}\;
\caption{The algorithm used to calculate the uniqueness score for an individual frame. Probabilities $\hat{p}\given{i|b}$ are predictions by the pre-trained network. During the accumulation these predictions will gradually improve proportional to the global training quality. Multiplying the unique percentage $|\mathrm{uids}|^{-1}|f(x)|$ by the (scaled) mean probability deals with cases of low accuracy, where individuals switch every frame (but uniquely).}
\end{algorithm}
\end{featurebox}
%\textit{uniqueness} is a percentage calculated per frame, which, compared to a normal \textit{accuracy} metric takes more of the local context into account.
%
The program generates image data for evenly spaced frames across the entire video. All images of tracked individuals within the selected frames are, after every epoch of the training, passed on to the network. It returns a vector of probabilities $p_{ij}$ for each image $i$ to be identity $j\in[0,N]$, with $N$ being the number of individuals. Based on these probabilities, uniqueness can be calculated as in Box \ref{box:uniqueness_score}, evenly covering the entire video. The magnitude of this probability vector per image is taken into account, rewarding strong predictions of $\max_j \left\{ p_{ij} \right\}=1$ and punishing weak predictions of $\max_j \left\{ p_{ij} \right\} <1$.
Uniqueness is not integrated as part of the loss function, but it is used as a global gradient before and after each training unit in order to detect global improvements. Based on the average uniqueness calculated before and after a training unit, we can determine whether to stop the training, or whether training on the current segment made our results worse (faulty data). If uniqueness is consistently high throughout the video, then training has been successful and we may terminate early. Otherwise, valleys in the uniqueness curve indicate bad generalization and thus currently missing information regarding some of the individuals. In order to detect problematic sections of the video we search for values below $1-\frac{0.5}{N}$, meaning that the section potentially contains new information we should be adding to our training data. Using accuracy per-batch and then using uniqueness to determine global progress, we get the best of both worlds: A context-free prediction method that is trained on global segments that are strategically selected by utilizing local context information.
%\begin{equation}
%\begin{split}
% U_{\marhrm{id}}(x) &= \set*{ \argmax{j} p_{ij} \given i,j \in \mathbb{N}; i \leq |f(x)| \wedge j \leq N_{\mathrm{id}} } \\
% P_j &= 1/\min(N_{\marhrm{id}}, |f(x)|) \sum_{i \in U_{\marhrm{id}}(x)}{ \underset{0\leq j \leq N_{\mathrm{id}}}{\max} p_{ij} } \\
% \sum_{j\in U_\mathrm{id}(x)}
%\end{split}
%\end{equation}
% uids = map{}
% p_{i,b} is the probability of blob b to be identity i
%
% for each object b \in objects(frame)
% max_id = argmax_i p_{i,b} where i \in identities
%
% if max_id in uids
% uids[max_id] = max(uids[max_id], p_{max_id, b})
% else
% uids[max_id] = p_{max_id, b}
%
% U_{frame} = logit(1/|uids| * \sum_{id \in uids} uids[id]) * |uids|/|objects(frame)|
%$$ U_{\marhrm{id}}(x) = \set*{ \argmax{j} p_{ij} \given i,j \in \mathbb{N}; i \leq |f(x)| \wedge j \leq N_{\mathrm{id}} } $$
%$$ \mathrm{uniqueness}(x) = 1/\min(N_{\marhrm{id}}, |f(x)|) \sum_{i \in U_{\marhrm{id}}(x)}{ \underset{0\leq j \leq N_{\mathrm{id}}}{\max} p_{ij} }. $$
%where $f(x)$ returns a set of all tracked objects within frame $x$. $U_{\marhrm{id}}(x)$ is a set of unique identities that have been detected within the given frame.
The closest example of such a procedure in \idtracker{} is the termination criterion after \textit{protocol 1}, which states that individual segments have to be consistent and certain enough in all global segments in order to stop iterating. While this seems to be similar at first, the way accuracy is calculated and the terminology here are quite different: (i) Every metric in \idtracker{'s} final assessment after \textit{protocol 1} is calculated at segment-level, not utilizing per-frame information. \textit{Uniqueness} works per-frame, not per segment, and considers individual frames to be entirely independent from each other. It can be considered a much stronger constraint set upon the network's predictive ability, seeing as it basically counts the number of times mistakes are estimated to have happened within single frames. Averaging only happens \textit{afterwards}. (ii) The terminology of identities being unique is only used in \idtracker{} once after \textit{procotol 1} and essentially as a binary value, not recognizing its potential as a descendable gradient. Images are simply added until a certain percentage of images has been reached, at which point accumulation is terminated. (iii) Testing uniqueness is much faster than testing network accuracy across segments, seeing as the same images are tested over and over again (meaning they can be cached) and the testing dataset can be much smaller due to its locality. \textit{Uniqueness} thus provides a stronger gradient estimation, while at the same time being more local (meaning it can be used independently of whether images are part of global segments), as well as more manageable in terms of speed and memory size.
%While the most obvious difference is that the quality assessment there works per consecutive individual segment and determines whether it is considered to be consistent, there are also more general differences: (ii) the way it is used there is very discrete and does not allow it to be recognized as a descendable gradient, (iii) the principle is not extended upon or used in any of the later steps, which simply continue accumulating global segments until a certain percentage of images has been added to the dataset.
In the next four sections, we describe the training phases of our algorithm (1-3), and how the successfully trained network can be used to automatically correct trajectories based on its predictions (4).
%Since our algorithm was originally inspired by their algorithm, it makes sense to also give a short summary of the \idtracker{} approach here:
%Below, we address the most significant changes to the individual parts of the algorithm categorized into topical units as follows:
%\begin{enumerate}
%\item Preparation
%\item Estimating training quality
%\item Minimizing the variance landscape by normalizing samples
%\item The initial training unit
%\item Accumulation of additional segments and stopping-criteria
%\item The final training unit
%\item Assigning identities based on network predictions
%\end{enumerate}
%For more technical information on each of the bullet points, please refer to the appendix \nameref{sec:appendix_recognition}.
%The following subsections will address each point in turn, chronologically describing every step of the algorithm.
\subsubsection{1. The Initial Training Unit}
All global segments are considered and sorted by the criteria listed below in \nameref{sec:accumulation_quality_criteria}. The best suitable segment from the beginning of that set of segments is used as the initial dataset for the network. Images are split into a training and a validation set (4:1 ratio). Efforts are made to equalize the sample sizes per class/identity beforehand, but there has to always be a trade-off between similar sample sizes (encouraging unbiased priors) and having as many samples as possible available for the network to learn from. Thus, in order to alleviate some of the severity of dealing with imbalanced datasets, the performance during training iterations is evaluated using a categorical focal loss function \citep{lin2017focal}. Focal loss down-weighs classes that are already reliably predicted by the network and in turn emphasizes neglected classes. An Adam optimizer \citep{kingma2014adam} is used to traverse the loss landscape towards the global (or to at least a local) minimum.
The network layout used for the classification in \TRex{} (see \figref{fig:software_overview}c) is a typical Convolutional Neural Network (CNN). The concepts of "convolutional" and "downsampling" layers, as well as the back-propagation used during training, are not new. They were introduced in \cite{fukushima1988neocognitron}, inspired originally by the work of Hubel and Wiesel on cats and rhesus monkeys (\citealt{hubel1959receptive}, \citealt{hubel1963receptive}, \citealt{wiesel1966spatial}), describing receptive fields and their hierarchical structure in the visual cortex. Soon afterward, in \cite{lecun1989backpropagation}, CNNs, in combination with back-propagation, were already successfully used to recognize handwritten ZIP codes -- for the first time, the learning process was fully automated. A critical step towards making their application practical, and the reason they are popular today.
The network architecture used in our software is similar to the identification module of the network in \cite{idtrackerai}, and is, as in most typical CNNs, (reverse-)pyramid-like. However, key differences between \TRex{'} and \idtracker{'s} procedure lie with the way that training data is prepared (see previous sections) and how further segments are accumulated/evaluated (see next section). Furthermore, contrary to \idtracker{'s} approach, images in \TRex{} are augmented (during training) before being passed on to the network. While this augmentation is relatively simple (random shift of the image in x-direction), it can help to account for positional noise introduced by e.g. the posture estimation or the video itself when the network is used for predictions later on \citep{perez2017effectiveness}. We do not flip the image in this step, or rotate it, since this would defeat the purpose of using orientation normalization in the first place (as in \nameref{sec:posture_normalization}, see \figref{fig:datasets_comparison}). Here, in fact, normalization of object orientation (during training and predictions) could be seen as a superior alternative to data augmentation.
The input data for \TRex{'} network is a single, cropped grayscale image of an individual (see \figref{fig:software_overview}c). This image is first passed through a "lambda" layer (blue) that normalizes the pixel values, dividing them by half the value limit of $255 / 2 = 127.5$ and subtracting $1$ -- this moves them into the range of $[-1,1]$. From then on, sections are a combination of convolutional layers (kernel sizes of 16, 64 and 100 pixels), each followed by a 2D (2x2) max-pooling and a 2D spatial dropout layer (with a rate of 0.25). Within each of these blocks the input data is reduced further, focussing it down to information that is deemed important. Towards the end, the data are flattened and flow into a densely connected layer (100 units) with exactly as many outputs as the number of classes. The output is a vector with values between $0$ and $1$ for all elements of the vector, which, due to softmax-activation, sum to $1$.
Training commences by performing a stochastic gradient descent (using the Adam optimizer, see \citealt{kingma2014adam}), which iteratively minimizes the error between network predictions and previously known associations of images with identities -- the original assignments within the initial frame segment. The optimizer's behavior in the last five epochs is continuously observed and training is terminated immediately if one of the following criteria is met:
%Training on the initial segment is continued until one of the following criteria is met:
\begin{itemize}[label=\textnormal{$\bullet$}]
\item the maximum number of iterations is reached (150 by default, but can be set by the user)
\item a plateau is achieved at a high per-class accuracy
\item overfitting/overly optimizing for the training data at the loss of generality
\item no further improvements can be made (due to the accuracy within the current training data already being $1$)
\end{itemize}
The initial training unit is also by far the most important as it determines the predicted identities within further segments that are to be added. It is thus less risky to overfit than it is important to get high-quality training results, and the algorithm has to be relatively conservative regarding termination criteria. Later iterations, however, are only meant to extend an already existing dataset and thus (with computation speed in mind) allow for additional termination criteria to be added:
\begin{itemize}[label=\textnormal{$\bullet$}]
\item plateauing at/circling around a certain \protect\path{val_loss} level
\item plateauing around a certain uniqueness level
\end{itemize}
%Please refer to the supplement for a more detailed definition of the individual criteria (\nameref{sec:recognition_stopping}).
\subsubsection{2. Accumulation of Additional Segments and Stopping-Criteria}
If necessary, initial training results can be improved by adding more samples to the active dataset. This could be done manually by the user, always trying to select the most promising segment next, but requiring such manual work is not acceptable for high-throughput processing. Instead, in order to translate this idea into features that can be calculated automatically, the following set of metrics is re-generated per (yet inactive) segment after each successful step:
\begin{enumerate} \label{sec:accumulation_quality_criteria}
\item Average uniqueness index (rounded to an integer percentage in 5\% steps)
\item Minimal distance to regions that have previously been trained on (rounded to the next power of two), larger is better as it potentially includes samples more different from the already known ones
\item Minimum \textit{cells visited} per individual (larger is better for the same reason as 2)
\item Minimum average samples per individual (larger is better)
\item Whether its image data has already been generated before (mostly for saving memory)
\item The uniqueness value is smaller than $U_{prev}^2$ after 5 steps, with $U_{prev}$ being the best uniqueness value previous to the current accumulation step
\end{enumerate}
With the help of these values, the segment list is sorted and the best segment selected to be considered next. Adding a segment to a set of already active samples requires us to correct the identities inside it, potentially switching temporary identities to represent the same \textit{real} identities as in our previous data. This is done by predicting identities for the new samples using the network that has been trained on the old samples. Making mistakes here can lead to significant subsequent problems, so merely plausible segments will be added - meaning only those samples are accepted for which the predicted IDs are \textit{unique} within each unobstructed sequence of frames for every temporary identity. If multiple temporary individuals are predicted to be the same real identity, the segment is saved for later and the search continues.
If multiple additional segments are found, the program tries to actively improve local uniqueness valleys by adding samples first from regions with comparatively \textit{low} accuracy predictions. Seeing as low accuracy regions will also most likely fail to predict unique identities, it is important to emphasize here that this is generally not a problem for the algorithm: Failed segments are simply ignored and can be inserted back into the queue later. Smoothing the curve also makes sure to prefer regions close to valleys, making the algorithm follow the valley walls upwards in both directions.
%Using such a fine-tuned and targeted selection method can help to significantly improve the analysis speed by avoiding to add sections that will not lead to better generalization.
%TODO: test this properly
Finishing a training unit does not necessarily mean that it was successful. Only the network states improving upon results from previous units are considered and saved. Any training result - except the initial one - may be rejected after training in case the uniqueness score has not improved globally, or at least remained within 99\% of the previous best value. This ensures stability of the process, even with tracking errors present (which can be corrected for later on, see next section). If a segment is rejected, the network is restored to the best recorded state.
%In each accumulation step, only the state of the network weights is saved that improved the most upon the previous accumulation steps regarding their global uniqueness value. Each accumulation step following the initial training may fail if no improvement could be made upon the previous steps, in which case they will be discarded.
Each new segment is always combined with regularly sampled data from previous steps, ensuring that identities don't switch back and forth between steps due to uncertain predictions. If switching did occur, then the uniqueness and accuracy values can never reach high value regimes -- leading to the training unit being discarded as a result. The contribution of each previously added segment $R$ is limited to $\ceil{|R_S| / ( \mathrm{samples\_max} * |R| / N )}$ samples, with $N$ as the total number of frames in global segments for this individual and $\mathrm{samples\_max}$ a constant that is calculated using image size and memory constraints (or 1GB by default). $R_S$ is the actual \textit{usable} number of images in segment $R$. This limitation is an attempt to not bias the priors of the network by sub-sampling segments according to their contribution to the total number of frames in global segments.
%long_t step_size = max(1, ceil(SR / (max_images_per_class * double(range.length()) / double(N))));
%// (with S_R being the set of actually available frames <= all frames within R)
Training is considered to be successful globally, as soon as either (i) accumulative individual gaps between sampled regions is less than 25\% of the video length for all individuals, or (ii) uniqueness has reached a value higher than \changemade{
\inlineequation[eq:gooduniqueness]{1-\frac{0.5}{N_{\mathrm{id}}}} }
so that almost all detected identities are present exactly once per frame. Otherwise, training will be continued as described above with additional segments -- each time extending the percentage of images seen by the network further.
Training accuracy/consistency could potentially be further improved by letting the program add an arbitrary amount of segments, however we found this not to be necessary in any of our test-cases. Users are allowed to set a custom limit if required in their specific cases.
\subsubsection{3. The Final Training Unit}
After the accumulation phase, one last training step is performed. In previous steps, validation data has been kept strictly separate from the training set to get a better gauge on how generalizable the results are to unseen parts of the video. \changemade{This is especially important during early training units, since "overfitting" is much more likely to occur in smaller datasets and we still potentially need to add samples from different parts of the video. Now that we are not going to extend our training dataset anymore, maintaining generalizibility is no longer the main} objective -- so why not use \textit{all} of the available data? The entire dataset is simply merged and sub-sampled again, according to the memory strategy used. Network training is started, with a maximum of $\max\{ 3; \mathrm{max\_epochs} * 0.25 \}$ iterations (max\_epochs is 150 by default). During this training, the same stopping-criteria apply as during the initial step.
\changemade{Even if we tolerate the risk of potentially overfitting on the training data, there is still a way to detect} overfitting if it occurs: \changemade{Only training steps that lead to improvements in mean uniqueness across the video} are saved. \changemade{Often, if prediction results become worse (e.g. due to overfitting), multiple individuals in a single frame are predicted to be the same identity -- precisely the problem which our uniqueness metric was designed to detect.}
\changemade{F}or some videos, this is the step where most progress is made (e.g. \videoref{vid:15locusts1h}). The reason being that this is the first time when all of the training data from all segments is considered at once (instead of mostly the current segment plus fewer samples from previously accepted segments), and samples from all parts of the video \changemade{have} an equal likelihood of being used in training after possible reduction due to memory-constraints.
\subsubsection{4. Assigning Identities Based on Network Predictions}
After the network has been successfully trained, all parts of the video which were not part of the training are packaged together and the network calculates predictive probabilities for each image of each individual to be any of the available identities. The vectors returned by the network are then averaged per consecutive segment per individual. The average probability vectors for all overlapping segments are weighed against each other -- usually forcing assignment to the most likely identity (ID) for each segment, given that no other segments have similar probabilities. When referring to segments here, meant is simply a number of consecutive frames of one individual that the tracker is fairly sure does \textit{not} contain any mix-ups. We implemented a way to detect tracking mistakes, which is mentioned later.
If an assignment is ambiguous, meaning that multiple segments $S_{j\dots M}$ overlapping in time have the same maximum probability index $\argmax{i\in[0,N]} \left\{ P\given{i|S_j} \right\}$ (for the segment to belong to a certain identity $i$), a decision has to be made. Assignments are deferred if the ratio
$$ R_\mathrm{max} = \max\left\{
\frac{P\given{i | S_j}}{P\given{i | S_k}}, \forall S_{j\not= k}\in \mathrm{\ overlapping\ segments} \right\} $$
between any two maximal probabilities is \textit{larger than} $0.6$ for said $i$ ($R_\mathrm{max}$ is inverted if it is greater than $1$). In such a case, we rely on the general purpose tracking algorithm to pick a sensible option -- other identities might even be successfully assigned (using network predictions) in following frames, which is a complexity we do not have to deal with here. In case all ratios are \textit{below} $0.6$, when the best choices per identity are not too ambiguous, the following steps are performed to resolve remaining conflicts:
\begin{enumerate}
\item count the number of samples $N_{me}$ in the current segment, and the number of samples $N_{he}$ in the other segment that this segment is compared to
\item calculate average probability vectors $P_{me}$ and $P_{he}$
\item if $S(P_{me}, N_{me}) \ge S(P_{he}, N_{he})$, then assign the current segment with the ID in question. Otherwise assign the ID to the other segment. Where:
\begin{equation}
\begin{split}
\mathrm{norm}(x) = \frac{x}{N_{me} + N_{he}},\ &
\mathrm{sig}(x) = \left(1 + e^{2\pi(0.5-x)}\right)^{-1} \\
S(p,x) = \mathrm{sig}(p) &+ \mathrm{sig}(\mathrm{norm}(x)) .
\end{split}
\end{equation}
\end{enumerate}
This procedure prefers segments with larger numbers of samples over segments with fewer samples, ensuring that identities are not switched around randomly whenever a short segment (e.g. of noisy data) is predicted to be the given identity for a few frames -- at least as long as a better alternative is available. The non-linearity in $S(p,x)$ exaggerates differences between lower values and dampens differences between higher values: For example, the quality of a segment with $4000$ samples is barely different from a segment with $5000$ samples; however, there is likely to be a significant quality difference between segments with $10$ and $100$ samples.
In case something goes wrong during the tracking, e.g. an individual is switched with another individual without the program knowing that it might have happened, the training might still be successful (for example if that particular segment has not been used for training). In such cases, the program tries to correct for identity switches mid-segment by calculating a running-window median identity throughout the whole segment. If the identity switches for a significant length of time, before identities are assigned to segments, the segment is split up at the point of the first change within the window and the two parts are handled as separate segments from then on.
\section{Software and Licenses}
\TRex{} is published under the GNU GPLv3 license (see \href{https://choosealicense.com/licenses/gpl-3.0/}{here} for permissions granted by GPLv3). All of the code has been written by the first author of this paper (a few individual lines of code from other sources have been marked inside the code). While none of these libraries are distributed alongside \TRex{} (they have to be provided separately), the following libraries are used: OpenCV (\href{https://opencv.org/about/}{opencv.org}) is a core library, used for all kinds of image manipulation. GLFW (\href{https://www.glfw.org}{glfw.org}) helps with opening application windows and maintaining graphics contexts, while DearImGui (\href{https://github.com/ocornut/imgui}{github.com/ocornut/imgui}) helps with some more abstractions regarding graphics. \texttt{pybind11} (\cite{pybind11}) for Python integration within a C++ environment. miniLZO (\href{http://www.oberhumer.com/opensource/lzo/\#minilzo}{oberhumer.com/opensource/lzo}) is used for compression of PV frames. Optional bindings are available to FFMPEG (\href{http://ffmpeg.org}{ffmpeg.org}) and libpng libraries, if available. (optional) GNU Libmicrohttpd (\href{https://www.gnu.org/software/libmicrohttpd/}{gnu.org/software/libmicrohttpd}), if available, can be used for an HTTP interface of the software, but is non-essential.
\section{Acknowledgments}
We thank A. Albi, F. Nowak, H. Hugo, D. E. Bath, F. Oberhauser, H. Naik, J. Graving, I. Etheredge for helping with their insights, by providing videos, for comments on the manuscript, testing the software and for frequent coffee breaks during development. The development of this software would not have been possible without them. \changemade{We thank D. Mink and M. Groettrup providing additional video material of mice. We thank the reviewers and editors for their constructive and useful comments and suggestions.} IDC acknowledges support from the NSF (IOS-1355061), the Office of Naval Research grant (ONR, N00014-19-1-2556), the Struktur- und Innovationsfunds f\"{u}r die Forschung of the State of Baden-W\"{u}rttemberg, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy--EXC 2117-422037984, and the Max Planck Society.
%\nocite{*} % This command displays all refs in the bib file. PLEASE DELETE IT BEFORE YOU SUBMIT YOUR MANUSCRIPT!
\bibliography{elife-sample}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% APPENDICES
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\appendix
\setcounter{table}{0}
%\textit{\textbf{Appendix \arabic{appendix}