Skip to content

Analysis work flow

Sangeeta Bhatia edited this page Jan 9, 2025 · 22 revisions

This document provides an overview of the PERG analysis workflow as discussed at the Hackathon on 11 April 2024. This is not set in stone, but we should work according to this document until we as a group decide to update it.

We have two packages related to this project:

  • priority-pathogens: although a public repository, this is where we perform our pathogen analyses and not intended to be public-facing. This is not currently an R package.
  • epireview: our public-facing R (data) package which has two components: the cleaned data and functionality to view and interrogate the data.

We can think of priority-pathogens as a use case of epireview. When developing in epireview, we must always be mindful of the fact that this should be a standalone R package and thus we want to minimise dependencies on other packages.

Here we outline the steps in the process for pathogen teams in producing their analyses for each pathogen. Note throughout that we are working in orderly2.

Please see example workflow scripts from the Ebola and Lassa teams for reference, and look at Figures 1 and 2 at the end of this document for example schematics.

priority-pathogens workflow

When you are ready to begin analysis on a specific pathogen (typically after extractions are completed),

  • check out a new branch in the priority-pathogens repo from main.
  • Extract data from the databases and compile the single and double extracted files into one file (see below for details on this)
  • Create new orderly tasks for analysis of models, outbreaks and each parameter.
  • Develop one task at a time and when a logical chunk of your work is ready (most commonly a single orderly task), raise a PR to merge your work into main. Before doing this, make sure that your task runs as an orderly task on your machine and someone else's machine. For instance, if you are developing task ``mypathogen_task1'', then navigate to the root of orderly project and do the following:
orderly2::orderly_run("mypathogen_task1")

This should run without any errors. Only then can your work be merged into main.

  • Once the work on your branch has been merged into main, delete your local branch from your machine, checkout main, pull and again create a new branch.
  • To ensure your work does not break anything else, if you modify any ``common'' code, then notify the outbreaks channel and ask pathogen leads to run their workflow and check their outputs.

Read databases into R and cleaning

  • Work undertaken in priority-pathogens.
  • Databases are kept on Teams (DIDE – WP/outbreaks/databases/). Database files will never be made public.
  • Data should not be made public until the corresponding preprint has been published thus should remain off of both github repos.

There is already an extensive amount of code to do this, specifically:

Produces csv files of extracted information stratified by extraction type (article, model, outbreak, parameter) and whether single or double extracted.

Identifies double extracted data and whether or not they match.

Combines single and double extraction databases together.

  • Each function has pathogen-specific logic can be added to for new pathogen (e.g. specific cleaning tasks).
  • This is high-level cleaning, there will likely be further cleaning downstream for specific analyses/tasks.

Data analysis

As we work on more pathogens, a lot of the functionality that you need for your analysis is likely to have been written already. These generic functions are hosted in epireview which you can call to priority-pathogens (instructions on installing epireview on github).

  • Work undertaken in priority-pathogens.
  • Each analysis task can be roughly categorised as follows:
  1. Code exists which is useable already for my use-case;
  2. Code exists but needs minor adaptations for my use-case;
  3. Nothing exists that is useable for my use-case.
  • Over time, example 2 is likely to be the most common situation you find yourself in.
  • Below we specify the actions required for each of the 3 examples above. For examples 2 and 3, there are further steps required for both your analysis but also about updating epireview based on your work.

Case 1: Code already useable for my use-case

No further actions!

Case 2: Code exists but needs minor adaptations for my use-case

  • Wherever the function is housed (should be epireview), create a new branch named with something related to the update you are making to the function.
  • Add your logic to the function so that it performs as you need, being extremely careful not to edit any of the existing functionality that is there as this could break workflows for other pathogens.
  • When you are happy with your additions, test that this is working as you expected, thinking of extreme cases which are most likely to break the function.

If your tests fail: go back to the previous step and repeat.

If your tests pass: update the function documentation.

  • Raise a pull request into the develop branch.

Case 3: Nothing exists that is useable for my use-case

  • Where this function is housed depends on its generalisability and general utility. In the majority of cases we want the function to go to epireview, but in some cases (e.g. package dependencies) we may wish for them to stay in priority-pathogens
  • Remain working in priority-pathogens while you develop your new function.
  • Write your function ensuring it is as generalisable as possible, so that it could potentially be reused by other pathogens.
  • We then follow similar steps to Example case 2 to add the function to epireview:

Create a new branch with the function name as the branch name.

Test that this is working as you expected, thinking of extreme cases which are most likely to break the function.

If your tests fail: go back to the previous step and repeat.

If your tests pass: update the function documentation.

  • Raise a pull request into the develop branch.

Live-updating analysis

  • Each pathogen will have an R markdown file in which the preprint tables and figures can be reproduced and subsequently updated with the latest data which could have be added to since publication.
  • Please see detailed instructions for this here.

Preprint release

  • When the preprint is released, the accompanying cleaned data should be put into epireview.

Top tips from the team

Please add to this with things that you have found particularly useful (or not!) when working on their analyses to help future pathogen teams.

  • Making tables of data is great for spotting mistakes, would iterate between making/updating tables and cleaning
  • Transmission/genomic parameters were most inaccurately extracted, e.g. attack rates, evolutionary rates, substitution rates …, so would pay close attention here
  • If certain parameter estimates are particularly important to your analysis, e.g. used in maps/meta-analyses etc, I would double-check that the values were extracted accurately
  • You can export a PRISMA flowchart directly from Covidence, which has to be included as one of the figures

Schematic diagrams

image

image