manuscript.Rnw

% Template for PLoS
% Version 3.4 January 2017
% -- FIGURES AND TABLES
%
% Please include tables/figure captions directly after the paragraph where they are first cited in the text.
%
% DO NOT INCLUDE GRAPHICS IN YOUR MANUSCRIPT
% - Figures should be uploaded separately from your manuscript file. 
% - Figures generated using LaTeX should be extracted and removed from the PDF before submission. 
% - Figures containing multiple panels/subfigures must be combined into one image file before submission.
% For figure citations, please use "Fig" instead of "Figure".
% See http://journals.plos.org/plosone/s/figures for PLOS figure guidelines.
%
% Tables should be cell-based and may not contain:
% - spacing/line breaks within cells to alter layout or alignment
% - do not nest tabular environments (no tabular environments within tabular environments)
% - no graphics or colored text (cell background color/shading OK)
% See http://journals.plos.org/plosone/s/tables for table guidelines.
%
% For tables that exceed the width of the text column, use the adjustwidth environment as illustrated in the example table in text below.
%
% % % % % % % % % % % % % % % % % % % % % % % %

\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in]{geometry}

\usepackage{amsmath,amssymb}
\usepackage{changepage}
\usepackage[utf8x]{inputenc}
\usepackage{textcomp,marvosym}
\usepackage{cite}
\usepackage{nameref,hyperref}
\usepackage[right]{lineno}
\usepackage{microtype}
\DisableLigatures[f]{encoding = *, family = * }
\usepackage[table]{xcolor}
\usepackage{array}

% create "+" rule type for thick vertical lines
\newcolumntype{+}{!{\vrule width 2pt}}

% create \thickcline for thick horizontal lines of variable length
\newlength\savedwidth
\newcommand\thickcline[1]{%
  \noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
  \cline{#1}%
  \noalign{\vskip\arrayrulewidth}%
  \noalign{\global\arrayrulewidth\savedwidth}%
}

% \thickhline command for thick horizontal lines that span the table
\newcommand\thickhline{\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\hline
\noalign{\global\arrayrulewidth\savedwidth}}

% Remove comment for double spacing
%\usepackage{setspace} 
%\doublespacing

% Text layout
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in 
\textheight 8.75in

% Bold the 'Figure #' in the caption and separate it from the title/caption with a period
% Captions will be left justified
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,justification=raggedright,singlelinecheck=off]{caption}
\renewcommand{\figurename}{Fig}

% Use the PLoS provided BiBTeX style
\bibliographystyle{plos2015}
%\bibliographystyle{apalike}

% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother

% Leave date blank
\date{}

% Header and Footer with logo
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\setlength{\headheight}{27.023pt}
\lhead{\includegraphics[width=2.0in]{PLOS-submission.eps}}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\lfoot{\sf PLOS}


\reversemarginpar
%\usepackage[colorinlistoftodos]{todonotes}
\usepackage[colorinlistoftodos,disable]{todonotes}

%% Include all macros below
\newcommand{\citationneeded}[2][]{\todo[color=blue, fancyline, #1]{\textbf{Citation Needed:} #2}}
\newcommand{\TODO}[2][]{\todo[color=red, fancyline, #1]{\textbf{TODO:} #2}}


\newcommand{\lorem}{{\bf LOREM}}
\newcommand{\ipsum}{{\bf IPSUM}}

%% END MACROS SECTION


\begin{document}
\vspace*{0.2in}

% Title must be 250 characters or less.
\begin{flushleft}
{\Large
\textbf\newline{Rethomics: an R framework to analyse high-throughput behavioural data} 
}
\newline
% Insert author names, affiliations and corresponding author email (do not include titles, positions, or degrees).
\\
Quentin Geissmann\textsuperscript{1*},
Luis Garcia Rodriguez\textsuperscript{2},
Esteban J. Beckwith\textsuperscript{1},
Giorgio F. Gilestro\textsuperscript{1*}
\\
\bigskip
\textbf{1} Department of Life Sciences, Imperial College London, London, United Kingdom
\\
\textbf{2} Institute for Neuro- and Behavioral Biology, Westf{\"a}lische Wilhelms University, 48149 M{\"u}nster, Germany
\\
\bigskip


% Use the asterisk to denote corresponding authorship and provide email address in note below.
* qgeissmann@gmail.com, giorgio@gilest.ro

\end{flushleft}
% Please keep the abstract below 300 words
\section*{Abstract}
%Ethomics, a quantitative and high-throughput approach to animal behaviour, is a new and exciting field.
The recent development of automatised methods to score various behaviours on a large number of animals
provides biologists with an unprecedented set of tools to decipher these complex phenotypes. 
Analysing such data comes with several challenges that are largely shared across acquisition platform and paradigms.
Here, we present \texttt{rethomics}, a set of \texttt{R} packages that unifies the analysis of behavioural datasets in an efficient and flexible manner.
%We implemented it in \texttt{R}\cite{r_core_team_r:_2017} since it is widely taught and adopted by computational biologists.
%It is also complemented with a vast ecosystem of open-source packages.
\texttt{rethomics} offers a computational solution to storing, manipulating and visualising large amounts of behavioural data.
We propose it as a tool to bridge the gap between behavioural biology and data sciences, thus connecting computational and behavioural scientists.
\texttt{rethomics} comes with a extensive documentation as well as a set of both practical and theoretical tutorials (available at \href{https://rethomics.github.io}{https://rethomics.github.io}).

%In this article, we describe it and show an example of its application to the study of circadian rhythms and blooming field of sleep fruit flies.
%The \texttt{rethomics} framework is available and documented at \href{https://rethomics.github.io}{https://rethomics.github.io}.


\listoftodos

% Please keep the Author Summary between 150 and 200 words
% Use first person. PLOS ONE authors please skip this step. 
% Author Summary not valid for PLOS ONE submissions.   
%\section*{Author summary}

% TODO revert for sumbmisson
\linenumbers
\TODO{revert line numbers before submission}
% TODO also embed bib

% Use "Eq" instead of "Equation" for equation citations.
\section*{Introduction}
%The behaviour of an animal is a complex phenotypical manifestation of the interaction between its nervous system, its internal state and its environment.


The biological implications and the determinism of animal behaviour have long been a subject of significant scientific interest.
In the 1960s, behavioural geneticists, such as Seymour Benzer, showed that some apparently complex behaviours could in fact be governed by simple genetic determinants, which connected genetics and behaviour\cite{sokolowski_drosophila_2001}.
In the last few decades, our ability to acquire and analyse vast amounts of biological data has tremendously increased\cite{stephens_big_2015},
deeply transforming both genetics\cite{schatz_biological_2015} and neuroscience\cite{sejnowski_putting_2014}.
%The rise of high-throughput methods and computational approaches, have particularly transformed genetics and neurobiology.
In fact, ethology itself is undergoing its own transition towards data sciences, which has prompted the terms `ethomics' \cite{branson_high-throughput_2009, reiser_ethomics_2009} and `computational ethology'\cite{anderson_toward_2014}. 
It is now accepted that the study of behaviour can also benefit from quantitative sciences such as machine learning, physics and computational linguistics \cite{brown_ethology_2018, berman_measuring_2018}.

%The availability of large amounts of data, in combination with the use of methods borrowed from the data sciences, allow for in-depth quantitative analyses which, in turn, leads to the characterisation of new principles and ultimately a better understanding of their underlying biology\cite{brown_ethology_2018, berman_measuring_2018}.
%In the last few decades, our ability to sequence vast quantities of genetic and phenoypic data has tremendously increased\cite{stephens_big_2015}.
%The scoring of various behavioural variables is certainly not an exception to this trend.
%High-throughput approaches to ethology has been proposed as a new discipline, prompting the concepts of `ethomics' \cite{branson_high-throughput_2009, reiser_ethomics_2009} and `computational ethology'\cite{anderson_toward_2014}.

%In the 1960's, Seymounr Benzer had shown:  behavioural genetics
%

Although general questions regarding the environmental, evolutionary, neural and genetic determinants of behaviours are shared within the community,
the multiplicity of model organisms, hypotheses and paradigms has led to the existence of a very diverse palette of specific recording techniques.
Some tools were developed, for instance,
to continuously record simple behavioural features such as walking activity\cite{faville_how_2015} and  position\cite{pelkowski_novel_2011} over long durations (days or weeks);
to score more complex ones such as feeding\cite{itskov_automated_2014,ro_flic_2014}, aggression\cite{dankert_automated_2009} and courtship\cite{tsai_image_2012};
and to study the behaviour of multiple interacting animals \cite{swierczek_high-throughput_2011,perez-escudero_idtracker_2014,robie_mapping_2017}.
Whilst most recording platforms are unrelated to each other, there are also some attempts to build general purpose tools that can be adapted by researchers to
suit their specific goals \cite{branson_high-throughput_2009,kabra_jaaba_2013,lopes_bonsai_2015,geissmann_ethoscopes_2017}.
However, when it comes to the subsequent analysis of the generated results, there is still no unified programmatic framework that could be used as a set of building blocks in a pipeline.


The fields of structural biology and  bioinformatics are good examples of communities that have taken advantage of sharing standard files formats, modular command line tools\cite{roumpeka_review_2017} and software packages\cite{huber_orchestrating_2015} that can be assembled into pipelines\cite{leipzig_review_2017}.
In these research areas, which are closely linked to data sciences and statistics, scripting interfaces are the standard since they help to deliver reproducible results \cite{peng_reproducible_2011,stodden_toward_2013}.
In addition, they can be used on remote resources such as computer clusters, which makes them more scalable to the context of `big data'\cite{hashem_rise_2015}.
Since many aspects of behaviour analysis are also becoming increasingly dependent on data sciences, the development of such common tools and data structures would be very valuable.

At first, it may seem as though behavioural experiments are prohibitively heterogeneous -- in terms of model organisms, paradigm and timescale -- for a similar community to arise.
However, low-level conceptual consistencies and methodological challenges are shared across experiments.
For instance, the results (\emph{i.e.} the `data')  feature a set of long time series (sometimes multivariate and irregular), but also contain a formal description of the treatment applied to each individual, the `metadata'.
Storing and accessing data and metadata efficiently involves the implementation of a nested data structure which, in principle,
can be shared between acquisition platforms and experimental paradigms.

Here, we describe the \texttt{rethomics} platform, an effort to promote the interaction between behavioural biologists and data scientists.
\texttt{rethomics} is implemented as a collection of interconnected packages, offering solutions to importing, storing, manipulating and visualising large amounts of behavioural results.
We also present two practical examples of its application to the analysis of behavioural rhythm in fruit flies, a widely studied subject.


\section*{Design and Implementation}
\texttt{rethomics} is implemented in \texttt{R}\cite{r_core_team_r_2017}
as a collection of packages (Fig~\ref{fig:fig-1}).
A summary of the functionalities of all packages is represented in a supplementary table (\nameref{S1-Table}).
Such modular architecture follows the model of modern frameworks such as the \texttt{tidyverse}\cite{wickham_tidyverse_2017}, which results in increased testability, maintainability and adaptability.
In this model, each task of the analysis workflow (\emph{i.e.} data import, manipulation and visualisation) is handled by a different package, and new ones can be designed to suit specific needs.
At the core of \texttt{rethomics} lies the \texttt{behavr} table, a structure used to store large amounts of data (\emph{e.g.} position and activity) and metadata (\emph{e.g.} treatment and genotype) in a unique \texttt{data.table}-derived object\cite{dowle_data.table_2017}.
Any input package will import experimental data as a \texttt{behavr} table which can, in turn, be analysed and visualised regardless of the original input platform.
Numerical results and plots are standard objects that can, therefore, be further analysed inside the wide \texttt{R} package ecosystem.

% Place figure captions after the first paragraph in which they are cited.


\begin{figure}[!h]
%	\includegraphics[width=1\textwidth]{fig/fig-1.pdf}
	\caption{{\bf The \texttt{rethomics} workflow.}
		Diagram representing the interplay between, from left to right, the raw data, the \texttt{rethomics} packages (in blue) and the rest of the \texttt{R} ecosystem.}
	\label{fig:fig-1}
\end{figure}


\subsection*{Internal data structure}

Ethomics results can easily scale and data structure therefore gains from being computationally efficient -- both in terms of memory footprint and processing speed.
For instance, there could be very long time series, sampled several times per second, over multiple days, for each individual. %(typically, $k_i > 10^8, \forall i \in [1,n]$) 
In addition, time series can be multivariate -- encoding coordinates, orientation, dimensions, activity, colour intensity and so on.
Furthermore, experiments may  feature a large number of individuals. %(typically $n > 100$).
Each individual is also associated with some metadata: a set of `metavariables' that describe experimental conditions.
For instance, metadata stores information regarding the date and location of the experiment, treatment, genotype, sex, \emph{post hoc} observations and other arbitrary metavariables.
A large set of metavariables is an important asset since they can later be treated as covariates. 

\texttt{behavr} tables link metadata and data within the same object, extending the syntax of \texttt{data.table} to manipulate, join and access metadata (Fig~\ref{fig:fig-2} and \nameref{S1-Fig}).
This approach guarantees that any data point can be mapped correctly to its parent metadata thanks to a shared key (\texttt{id}).
Furthermore, it allows implicit update of metadata when data is altered.
For instance, when data is subsetted, only the remaining individuals should be in the new metadata. 
It is also important that metadata and data can interoperate --
for example, when updating a variable according to the value of a metavariable (say, altering the variable \texttt{x} only for animals with the metavariable \texttt{sex = `male'}).
The online tutorials and documentation provide a detailed set of examples and concrete use cases of \texttt{behavr}. 


\begin{figure}[!h]
% 	\includegraphics[width=1\textwidth]{fig/fig-2.pdf}
	\caption{{\bf \texttt{behavr} table.}
	Illustration of a \texttt{behavr} object, the core data structure in \texttt{rethomics}.
		The metadata holds a single row for each of the $n$ individuals. 
		Its columns, the $p$ metavariables, are one of two kinds: either required -- and defined by the acquisition platform (\emph{i.e.} used to fetch the data) -- or user-defined (\emph{i.e.} arbitrary).
		In the data, each row is a `read' (\emph{i.e.} information about one individual at one time-point).
		It is formed of $q$ variables and is expected to have a very large number of reads, $k$, for each individual $i\in [1,n]$.
		Data and metadata are implicitly joined on the \texttt{id} field.
		Note that the names used for variables and metavariable in this example are only plausible cases which will likely differ in practice. 
	}
	\label{fig:fig-2}
\end{figure}

\subsection*{Data import}
Data import packages translate results from a specific recording platform (\emph{e.g.} text files and relational databases) into a single \texttt{behavr} object.
Currently, we provide two packages: one to import data from single and multi-beam Drosophila Activity Monitor Systems (Trikinetics Inc.) 
and another one for Ethoscopes \cite{geissmann_ethoscopes_2017}.
Although the structure of the raw results is very different, conceptually, loading data is very similar.
In all cases, users must provide a metadata table, with one row per individual,
and featuring both mandatory and optional columns (Fig~\ref{fig:fig-2}).
The mandatory ones are the necessary and sufficient information to fetch data (\emph{e.g.} machine id, region of interest and date). 
The optional columns are user-defined arbitrary fields that translate experimental conditions (\emph{e.g.} treatment, genotype and sex).

In this respect, the metadata file is a standardised and comprehensive data frame describing an experiment.
It explicitly lists all treatments and individuals, which facilitates interspersion of conditions.
%(without it, users are tempted to simplify their experimental design by, for instance, confounding device/location and treatment).
Furthermore, it streamlines the inclusion and analysis of further replicates in the same workflow.
Indeed, additional replicates can simply be added as new rows -- and the \texttt{id} of the replicate later used, if needed, as a covariate.	


\subsection*{Visualisation}

To integrate visualisation in \texttt{rethomics}, we implemented \texttt{ggetho},
a package that offers new tools that extend the widely adopted \texttt{ggplot2}\cite{wickham_ggplot2_2016}.
\texttt{ggetho} makes full use of the internal \texttt{behavr} structure to summarise temporal trends.
We implemented a set of new `layers' and `scales' that particularly applies to the visualisation of long experiments, with the ability to, for instance, display `double-plotted actograms', periodograms, annotate light and dark phases and wrap time over a given period. 
Importantly, \texttt{ggetho} is fully compatible with \texttt{ggplot2}. 
For instance, \texttt{ggplot2} operations such as faceting, transforming axes and adding new layers functions natively with \texttt{ggetho}.

<<'child-results.Rnw', child='results.Rnw'>>=
@

% We present an small example examining the circadian behaviour of 128 fruit flies in recorded in a DAM2 -- a paradigm very widely adopted. 
% A less formal description of this case and others are explained at \href{https://rethomics.github.io/}{https://rethomics.github.io/}.
% In order to prodive a more comprehensive and didactic example, data was altered and simplified.	
% Fig~\ref{fig:experiment} describes our case experiment (A) and the corresponding metadata (B).
% Briefly, three monitors were used each containing 16 males and 16 females of a different genotype (A, B and C).
% Two replicates were performed at different times.
% raw data and metadata files are publicly available TODO cite zenodo


\section*{Discussion}

The framework we presented in this article describes a general computational set of tools to store, manipulate and analyse behavioural data. 
As a practical case study, in this article, we showed how \texttt{rethomics} can readily be used to analyse circadian and sleep data.
As such, it already fills a clear gap in the sleep and circadian community by providing an open, extensible and programmatic toolbox as opposed to a rigid array of GUI tools which cannot interoperate. 
In particular, within the R environment, rethomic features a streamlined workflow (Fig.~\ref{fig:fig-1}): data loading; quality control (Fig.~\ref{fig:fig-3}); per-individual visualisation (Fig.~\ref{fig:fig-4}A) and computation (Fig.~\ref{fig:fig-4}B); group averages and statistical analysis (Fig.~\ref{fig:fig-5}), ultimately resulting in publication-quality figures. 
In addition to the canonical analysis we presented in this study, the richness of the \texttt{R} environment makes \texttt{rethomics} particularly interesting for exploratory work: as a case in point, we showed here the use of continuous wavelet transform to study the modulation of walking pace (Fig.~\ref{fig:fig-6}). Altogether, \texttt{rethomics} will allow the community to share solutions for the analysis and visualisation of behavioural data originated from a plethora of recording equipment. 
By design, \texttt{rethomics} is meant to be flexible and certainly not limited to the study of circadian rhythms.
In fact, \texttt{behavr} tables (Fig.~\ref{fig:fig-2}) are a general structure to link data to metadata that come with no assumption regarding the type and number of variables stored.
Therefore, we anticipate it will apply to a variety of contexts and scales.
In the last decade, multiple devices have been developed -- either commercially or academically -- to record longitudinal data on several animal models (i.e. feeding, metabolic activity, heart rate, position and so on).
All these devices and behavioural paradigm could take advantage of the skeletal structure in \texttt{rethomics} and lead to  the development of device-specific packages.
Such an approach becomes particularly useful in order to compare methods or study behaviours that are recorded by different devices on the same individual animals.


\section*{Availability}
All \texttt{rethomics} packages are listed at \href{https://rethomics.github.io/intro.html}{https://rethomics.github.io/intro.html}.
and are hosted on the official Comprehensive \texttt{R} Archive Network (CRAN) under the terms of the GPLv3 license. 
Installation instructions, as well as reproducible demos and tutorials, are available at
\href{https://rethomics.github.io/}{https://rethomics.github.io/}.
All packages are continuously integrated and unit tested on several versions of \texttt{R} to minimise the risk of present and future issues.

\section*{Supporting information}

% Include only the SI item label in the paragraph heading. Use the \nameref{label} command to cite SI items in the text.


\paragraph*{S1 Table.}
\label{S1-Table}
{\bf List of packages and main functionalities implemented in \texttt{rethomics}.}
The last column shows a list of user-space functions for each role.
Extensive list of all functions is available in the documentation of each package.

\paragraph*{S1 Fig.}
\label{S1-Fig}
{\bf Non-exhaustive list of uses of a \texttt{behavr} table (referred as \texttt{dt}).}
In addition to operations on data, which are inherited from \texttt{data.table},
we provide utilities designed specifically to act on both metadata and data.  
Commented examples are prefixed by `\texttt{>}'.


\paragraph*{S2 Fig.}
\label{S2-Fig}
{\bf Complete version of Fig~\ref{fig:fig-4}.}
See Fig~\ref{fig:fig-4} for legend.


\section*{Acknowledgements}
We would like to thank people who have directly or indirectly contributed to the this manuscript.
In particular, Han Kim, for his invaluable comments on the early versions of \texttt{rethomics} and his dedicated contribution to the tutorials;
Maite Ogueta and Ralf Stanewsky, for making the DAMS results data available;
Anne Petzold, Alice French, Hannah Jones, Diana Bicazan and Florencia Fernandez-Chiappe for their comments as early users;
Marcus Ghosh and Tara Kane for their feedback on the manuscript;
Brenna Williams, for her help to support multi-beam DAMS;
Patrick Kr{\"a}tschmer, for his time discussing the conceptual framework.


% Include only the SI item label in the paragraph heading. Use the \nameref{label} command to cite SI items in the text.
%\paragraph*{S1 Fig.}
%\label{S1_Fig}
%{\bf Bold the title sentence.} Add descriptive text after the title of the item (optional).

\nolinenumbers

% Either type in your references using
% \begin{thebibliography}{}
% \bibitem{}
% Text
% \end{thebibliography}
%
% or
%
% Compile your BiBTeX database using our plos2015.bst
% style file and paste the contents of your .bbl file
% here. See http://journals.plos.org/plosone/s/latex for 
% step-by-step instructions.
% 


\bibliography{manuscript}{}


%\begin{thebibliography}{10}
%\end{thebibliography}


\end{document}