Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chapter 19 data set linelist_cleaned.rds - suggestions to better adapt it for more realistic analytical results #151

Open
AmyMikhail opened this issue Apr 8, 2024 · 0 comments

Comments

@AmyMikhail
Copy link

Problem statement:

I recently used the data set from chapter 19 (univariate and multivariable regression) in a tutorial on descriptive statistics and logistic regression, in an R coaching group for epidemiologists in MSF. The group meet on a regular basis and suggest topics that have come up during their routine work. You can find the tutorial here.

The tutorial is based on a fictitious case study / scenario, where the aim is to identify risk factors for death. It takes users through descriptive statistics with dplyr and gtsummary, to calculating odds ratios for different potential risk factors. While the data set does at least provide the opportunity to do this type of analysis (using outcome being death as case definition), some of the results didn't make a lot of epidemiological sense. Although one could create other plausible explanations for the results, I think it might be more helpful if the data in this data set were redistributed to tell a more realistic story. Some examples below:

  1. The patient population at St Mark's maternity hospital includes quite a number of adult males and children, i.e. people outside the demographic that type of hospital would normally be expected to target. The age and sex distribution in this hospital is not very different to the other hospitals in this data set.
  2. Longer periods between symptom onset and hospitalization in this data set appear to have a protective effect against death (OR < 1), which is contrary to what one might expect.
  3. The range of PCR cycle threshold values is unusually narrow and while late Cts (as a proxy for low viral load) are protective against death (OR < 1) as would be expected, the descriptive statistics don't illustrate that (median Ct for patients that died is higher than those that recovered, for example).
  4. The infector variable cannot be used in a multivariable model as it causes convergence problems (likely due to small numbers).

Suggestions:

  • Redistribute the data so that patients at the maternity hospital are predominantly women of child bearing age
  • One of the other hospitals could have a predominance of cases from early in the outbreak and higher deaths during this period (introduces the possibility to use the data set for time series analysis as well)
  • Possibly have more deaths among those that had longer times to hospitalization (though other stories to explain the opposite result are also possible, see examples in the tutorial).
  • Increase variation in the Ct values so that the descriptive statistics are more interesting (but maintain the relationship between early Ct and death as a proxy for high viral load).
  • Make one of the infectors a super-spreader
  • Add more exposure categories to the source column, e.g. healthcare, education, workplace, mass-gathering event and associate one or more of these with a higher risk of death
  • Associate some combination of symptoms (low or high BMI, high temperature, cough?) with death.

Technical:

I was wondering if there is an r package that would facilitate redistribution of the data in this way, and there is: the {simstudy} package seems to do exactly this (see vignette here). There are also other options mentioned in this blog. @nsbatra you may already be aware of these but thought to put them here just in case.

It is probably good to leave some complexities and anomalies in the data, so I wouldn't necessarily advocate implementing all the above suggestions, but implementing some of them might help to build a story that will make it somewhat easier for people using the data set to learn analytical approaches in R to equate it to their own data sets or experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant