forked from EpiCoronaHack/Hackathon2020
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsingapore.Rmd
460 lines (350 loc) · 20.5 KB
/
singapore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
---
title: "Singapore"
author: "Caroline Colijn and Michelle Coombe"
date: "25/02/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(survminer)
library(survival)
library(tidyverse)
library(lubridate)
library(icenReg)
library(igraph)
library(visNetwork)
library(stringr)
options(digits=3)
```
## Data
Thanks to EpiCoronaHack Cluster team. These data are manually entered from postings from the Government of Singapore website:
* source1: TO CONFIRM WITH LAUREN
```{r}
spdata <- read_csv("~/EpiCoronaHack_Cluster/Clustering/data/COVID-19_Singapore.csv")
spdata <- read_csv("~/Dropbox/Transmission/nCov2019/Singapore/COVID-19_Singapore_feb29_1pm.csv")
# Ensure properly imported
glimpse(spdata)
colSums(is.na(spdata))
# Rename columns 2, 3 and 4 so no spaces
spdata <- rename(spdata, related_cases = starts_with("Related"),
cluster_links = "Cluster links",
relationship_notes = starts_with("Relation"))
# Change date columns into date objects
spdata <- mutate(spdata, presumed_infected_date = dmy(presumed_infected_date),
last_poss_exposure = dmy(last_poss_exposure),
symp_presumed_infector = dmy(symp_presumed_infector),
date_onset_symptoms = dmy(date_onset_symptoms),
date_quarantine = dmy(date_quarantine),
date_hospital = dmy(date_hospital),
date_confirmation = dmy(date_confirmation),
date_discharge = dmy(date_discharge))
# make sure dates parsed properly
range(spdata$presumed_infected_date, na.rm = T)
range(spdata$last_poss_exposure, na.rm = T)
range(spdata$symp_presumed_infector, na.rm = T)
range(spdata$date_onset_symptoms, na.rm = T)
range(spdata$date_quarantine, na.rm = T)
range(spdata$date_hospital, na.rm = T)
range(spdata$date_confirmation, na.rm = T)
range(spdata$date_discharge, na.rm = T)
# Note that case 36 is listed has having symptoms 16 days AFTER being hospitalized; suspect a typo in the month, fixing:
# spdata$date_onset_symptoms[spdata$CaseID==36] <- ymd("2020-01-24")
# Note that the date of symp_presumed_infector for CaseID 79 changed was originally listed as 2020-02-07 (based on online visualizations) but was changed to 2020-02-10, due to Feb 10, 2020 being on the earliest date of onset of symptoms from case 72, as from online info provided, presumed infective contact for CaseID 79 is from 72 (family member), rather than directly from case 52
spdata$symp_presumed_infector[spdata$CaseID == 79] <- ymd("2020-02-10")
# Change symp_presumed_infector to Feb 10, 2020 (date of symptom onset from caseID 72, the presumed infector)
spdata <- filter(spdata, !is.na(date_onset_symptoms)) #Remove all the cases that do not have info on date of symptom onset
# NOTE NOTE 12 of these, but they have a date of confiramation and dates of presumed infection - COULD FIX
```
## Incubation period
The incubation period is the time between exposure and the onset of symptoms. We estimate this directly from the stated start and end times for cases' exposure windows. These are explicitly listed for the Tianjin dataset but in Singapore they are approximated using contact tracing and the route by which a case was exposed. Because it is explicitly about the symptom onset, we remove those who don't have symptom onset defined. (These are a small minority of 12 cases and the alternative would be to impute their symptom onset time using the others' delay to confirmation time. For now, we remove them).
Then, if no other end time for the exposure is given or if the end of the exposure time is after the time of symptom onset, set the last exposure time to the symptom onset time. This is because they must have been exposed before symptom onset. We use four ideas to set the end time for the exposure window:
* 1: the end source is last possible exposure, if this is given
* 2: if it is not given, then we set the end of the exposure window to the time of symptoms of the presumed infector plus a noise term epsilon (eps)
* 3: and if neither the last possible expsure or the symptom time of the presumed infector are given, the last exposure time is set to the time of symptom onset.
* 4 Finally, we do not let the last possible exposure time be later than the time of symptom onset
```{r}
spdata$end_source = spdata$last_poss_exposure # 1 above
eps=4
hasPresInf = which(is.na(spdata$last_poss_exposure) & !(is.na(spdata$symp_presumed_infector))) # 2 above
spdata$end_source[hasPresInf] = spdata$presumed_infected_date[hasPresInf]+eps
hasNone = which(is.na(spdata$last_poss_exposure) & is.na(spdata$symp_presumed_infector)) # 3 above
spdata$end_source[hasNone] = spdata$date_onset_symptoms[hasNone]
spdata$end_source = pmin(spdata$end_source, spdata$date_onset_symptoms) # 4
```
Model the start source
* 1 if the time of presumed infector is given, use that - epsilon
* If it is not given use symptom onset minus say 20 days, based on prior
knowledge
```{r}
spdata$start_source = spdata$presumed_infected_date - eps # 1
spdata$start_source[is.na(spdata$presumed_infected_date)] = spdata$date_onset_symptoms[is.na(spdata$presumed_infected_date)]-20
```
```{r}
spdata$minIncTimes <- spdata$date_onset_symptoms - spdata$end_source
spdata$maxIncTimes <- spdata$date_onset_symptoms - spdata$start_source
```
We use survival analysis in the icenReg package to make parametric estimates, and we use the regular survival package to estimate the time to onset of symptoms.
```{r}
ggsurvplot(
fit <- survfit(Surv(spdata$minIncTimes, spdata$maxIncTimes, type="interval2") ~ 1, data = spdata),
xlab="Days",
ylab = "Overall probability of no symptoms yet")
```
I'll just try one where I stratify by whether the person has a last possible exposure given, or not.
```{r}
spcopy = spdata; spcopy$has_last = as.factor(!(is.na(spdata$last_poss_exposure)))
spcopyfit <- ic_par(Surv(spcopy$minIncTimes, spcopy$maxIncTimes, type="interval2") ~ has_last, data = spcopy, dist = "weibull")
summary(spcopyfit)
getFitEsts(spcopyfit, newdata = data.frame(has_last=as.factor(TRUE)), p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
getFitEsts(spcopyfit, newdata = data.frame(has_last=as.factor(FALSE)), p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
# OK - so for those who have a last poss exposure we have inc of 5.22 days , and for everyone, 7.46 days (!) suggesting that using the infected times for those presumed
# infectors is not correct. there are missing intermediate cases.
ggsurvplot(
fit <- survfit(Surv(spcopy$minIncTimes, spcopy$maxIncTimes, type="interval2") ~ spcopy$has_last), data = spcopy,
xlab="Days",
ylab = "Overall probability of no symptoms yet",
surv.median.line = c('hv'))
ggsave("inc_sing_by_haslastexp.pdf", height = 6, width = 8)
```
Based on this result I am going to create a supplementary analysis using only those people who had a last possible exposure; then can create the same for all cases or for just those who don't.
We use interval censoring, because we know only that exposure was some time between the minimum and maximum possible values.
```{r}
# sum(is.na(spdata$minIncTimes)) # 0
# to switch: choose from these two lines
# spfirst = spcopy[which(spcopy$has_last ==TRUE),]
spfirst = spdata
spfit <- ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "weibull")
summary(spfit)
spfit_gamma<- ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "gamma")
summary(spfit_gamma)
spfit_lnorm = ic_par(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2") ~ 1, data = spdata, dist = "lnorm")
summary(spfit_lnorm)
```
The log of the shape parameter is `r spfit$coefficients[1]` $\pm$ `r sqrt(spfit$var[1,1])`, which gives a shape parameter of `r exp(spfit$coefficients[1])` with a 1.96-sd (in the log) giving the range (`r exp(spfit$coefficients[1]-1.96*sqrt(spfit$var[1,1]))`, `r exp(spfit$coefficients[1]+1.96*sqrt(spfit$var[1,1]))`).
Similarly the log scale parameter is `r spfit$coefficients[2]` $\pm$ `r sqrt(spfit$var[2,2])`, which gives a scale parameter of `r exp(spfit$coefficients[2])` with a one-sd (in the log) giving the range (`r exp(spfit$coefficients[2]-1.96*sqrt(spfit$var[2,2]))`, `r exp(spfit$coefficients[2]+1.96*sqrt(spfit$var[2,2]))`).
```{r}
interqs <- getFitEsts(spfit, newdata = NULL, p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975)) #
interqs
interqs_gamma <- getFitEsts(spfit_gamma, newdata=NULL, p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_gamma
interqs_lnorm <- getFitEsts(spfit_lnorm, newdata=NULL, p
=c(0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975))
interqs_lnorm
```
Information for a table for the paper:
```{r}
# weibull shape scale then quantiles:
c( exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2]) , getFitEsts(spfit, newdata = NULL, p =c(0.025, 0.5, 0.975)))
# gamma shape, scale then quants
c( shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2]), getFitEsts(spfit_gamma, newdata=NULL, p =c(0.025, 0.5, 0.975)))
# lnorm meanlog, sdlog and quants
c( meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]), getFitEsts(spfit_lnorm, newdata=NULL, p =c(0.025, 0.5, 0.975)))
```
The median is `r interqs[4]` days and the 0.95 at `r interqs[6]`.
Here is a plot of the estimated distribution together with the empirical survival curve from the data.
```{r}
spdays <- seq(0,20, by=0.05)
spdensity <- dweibull(spdays, shape = exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2]))
spdens_gamma=dgamma(spdays, shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2]))
spdens_lnorm=dlnorm(spdays, meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]))
ggsp = ggsurvplot(
fit=survfit(Surv(spfirst$minIncTimes, spfirst$maxIncTimes, type="interval2")~1, data=spfirst),
xlab="Days", ylab = "Overall probability of no symptoms yet")
pdata <- data.frame(days=rep(spdays,3),
fitsurv=c(1-pweibull(spdays, shape = exp(spfit$coefficients[1]), scale = exp(spfit$coefficients[2])),
1-pgamma(spdays, shape = exp(spfit_gamma$coefficients[1]), scale = exp(spfit_gamma$coefficients[2])),
1-plnorm(spdays, meanlog = spfit_lnorm$coefficients[1], sdlog = exp(spfit_lnorm$coefficients[2]))),distn=c(rep("Weibull", length(spdays)), rep("Gamma",length(spdays)), rep("Lognorm", length(spdays)) ))
ggsp$plot + geom_line(data = pdata, aes(x = days, y = fitsurv,color=distn))
# ggsave(filename = "inc_Sing_firstonly.pdf", width = 8, height = 6)
```
## Serial interval
The simplest serial interval estimate we can make with these data is a direct estimate based on the time of symptoms of the presumed infector, and the time of symptoms of the case. However, this does not account for the fact that the presumed infector is not necessarily the infector.
```{r}
directSI=spdata$date_onset_symptoms - spdata$symp_presumed_infector
directSI=as.numeric(directSI[!is.na(directSI)])
hist(directSI,breaks = 10)
mean(directSI)
sd(directSI)
```
We will estimate the serial interval using the 'interval case to case' approach given in Vink et al (https://academic.oup.com/aje/article/180/9/865/2739204).
The dataset has several instances where a putative infector or contact is known. These are listed in the 'related_cases' column. We first make a graph in which nodes are individuals and edges are present from cases listed as possible sources, to the cases for whom they are possible sources.
```{r}
spnodes <- spdata$CaseID
## How to extract caseIDs from related_cases column - there are multiple values in some cells, separated by commas
spdata$related_cases #7 max within one cell
# Split into separate columns
spdata <- separate(spdata,
col = related_cases,
into = paste("contactID", 1:7, sep = "_"),
fill = "right")
# Turn into numeric values
spdata <- mutate(spdata,
contactID_1 = as.numeric(contactID_1),
contactID_2 = as.numeric(contactID_2),
contactID_3 = as.numeric(contactID_3),
contactID_4 = as.numeric(contactID_4),
contactID_5 = as.numeric(contactID_5),
contactID_6 = as.numeric(contactID_6),
contactID_7 = as.numeric(contactID_7))
# Select down to columns of interest
spedges <- select(spdata, c(CaseID, starts_with("contactID")))
# Remove rows with NAs for at least one contact
spedges <- filter(spedges, !is.na(spedges$contactID_1)) #43 CasesIDs with 1 or more possible contacts
```
That is nice but visNetwork and igraph require an edge list with from, to nodes. So for each row of spedges we create entries like these.
NOTE still need to check whether the related cases came prior to the stated cases.. (but this may come out in the wash, in the ICC method)
```{r}
singedges = data.frame(from=2,to=1)
for (n in 1:nrow(spedges)) {
for (k in 2:ncol(spedges)) {
if (!is.na(spedges[n,k])) {
singedges=rbind(singedges, c(spedges[[n,k]],spedges[[n,1]]))
}
}
}
singedges=singedges[-1,]
# create undirected graph by removing duplicates
undir=data.frame(from = pmin(singedges[,1],singedges[,2]),
to=pmax(singedges[,1], singedges[,2]))
undir=unique(undir)
undir = undir[-which(undir[,1]==undir[,2]),]
fedges = data.frame(from=paste("case",undir[,1],sep=""),
to=paste("case",undir[,2],sep=""))
```
From this edge list we can use visNetwork to visualise the graph. Make 'group' based on source of probably infection. Colours are from the infection source column (but we could have a better colour scheme, like date of symptom onset).
```{r}
# Turn 'presumed_reason' into lower case and get trim any whitespace so don't have issues with case sensitivity, etc
spdata$presumed_reason <- str_to_lower(spdata$presumed_reason)
spdata$presumed_reason <- str_trim(spdata$presumed_reason)
table(spdata$presumed_reason)
sum(is.na(spdata$presumed_reason)) #15 NAs
# Make a new column where we group the 'presumed_reason' under a label (known relationship, gathering, wuhan travel) for each of the above three groups
spdata <- mutate(spdata, presumed_reason_group = case_when(!is.na(str_match(presumed_reason, "symptom onset|via")) ~ "Known relationship",
!is.na(str_match(presumed_reason, "grace")) ~ "Grace Assembly of God",
!is.na(str_match(presumed_reason, "grand")) ~ "Grand Hyatt Singapore",
!is.na(str_match(presumed_reason, "life")) ~ "Life Church",
!is.na(str_match(presumed_reason, "seletar")) ~ "Seletar Aerospace Heights",
!is.na(str_match(presumed_reason, "yong")) ~ "Yong Thai Hang",
!is.na(str_match(presumed_reason, "wuhan|airport")) ~ "Wuhan travel", #'airport' case (CaseID 17) does not have 'wuhan' in reason but does have it under 'Case' column that they are from Wuhan
is.na(presumed_reason) ~ "Unknown",
TRUE ~ "other")) #should not be any other, so is just a double check this has run correctly, especially as dataset grows
table(spdata$presumed_reason_group)
```
```{r}
nodes.df <- data.frame(id=paste("case",spdata$CaseID,sep=""), label=spdata$CaseID, group=spdata$presumed_reason_group)
glimpse(nodes.df)
spdata$graphID = paste("case",spdata$CaseID,sep="")
visNetwork(nodes.df, fedges) %>% visLegend()
```
Now we estimate the serial interval using the ICC method; for this we first construct a graph. The "interval case to case" data are from identifying a putative first infector each small cluster in the graph, and finding the times between symptom onset in the first observed case and the others. See Vink et al.
```{r}
sgraph = graph_from_edgelist(as.matrix(fedges[,1:2]), directed = FALSE)
ccs=components(sgraph)
spdata$component=vapply(spdata$graphID, function(x)
{ if (x %in% names(ccs$membership)) { return(ccs$membership[match(x, names(ccs$membership))])
} else {
return(NA)}}, FUN.VALUE = 3)
```
Now knowing the components of the graph I can extract the ICC intervals.
I did this in a few ways (commented out lines): taking the first
case for each cluster to be the first reported symptoms (I get a 5 day serial interval); the first start exposure time (now there are negative ICCs so I get a 4.5 day serial interval) and the latest end exposure time.
Extract ICC interval data: a function
```{r}
getICCs <- function(thisdata, ccs, K, orderby= "onset" ) {
iccs=1
for (n in 1:max(ccs$membership)) {
mycases = which(thisdata$component==n)
if (orderby == "onset")
{ myonsets = sort(thisdata$date_onset_symptoms[mycases])[1:min(K, length(mycases))]}
if (orderby == "exposure") {
myonsets =thisdata$date_onset_symptoms[mycases][order(thisdata$end_source[mycases])][1:min(K,length(mycases))]
# myonsets = spdata$date_onset_symptoms[mycases[order(spdata$start_source[mycases])]] # alternative also ORDERS by earliest exposure
}
iccs =c(iccs, myonsets[-1]-myonsets[1])
}
return(iccs[-1])
}
```
```{r}
iccall = getICCs(spdata,ccs,9)
icc3 = getICCs(spdata,ccs,3)
icc4 = getICCs(spdata,ccs,4)
icc5 = getICCs(spdata,ccs,5)
icc6 = getICCs(spdata,ccs,6)
icc_expose = getICCs(spdata, ccs, 4, orderby ="exposure")
```
```{r}
source("TianjinSI_VinkWallinga_CC.R")
myestimate = serial_mix_est(data=icc4, N=100, startmu=10, startsig =4)
myestimate
myest3 = serial_mix_est(data=icc3, N=50, startmu=10, startsig =4)
myest4 = serial_mix_est(data=icc4, N=50, startmu=10, startsig =4)
myest5 = serial_mix_est(data=icc5, N=50, startmu=10, startsig =4)
myest6 = serial_mix_est(data=icc6, N=50, startmu=10, startsig =4)
myest_exp= serial_mix_est(data=icc_expose, N=50, startmu=10, startsig =4)
mm=rbind(myestimate,myest3, myest4, myest5,myest6, myest_exp)
colnames(mm)=c("mu","sig")
mm=as.data.frame(mm)
mm$K=c(4, 3,4, 5, 6, 4)
mm$ordering = c("Onset","Onset","Onset","Onset","Onset","LastExposure")
print(mm[,c(4,3,1,2)])
```
```{r,eval=FALSE}
days = seq(from=0, to=10, by=0.1)
density= dnorm(days, mean = myestimate[1], sd = myestimate[2])
ggplot(data=data.frame(days=days, density=density), aes(x=days,y=density)) + geom_line() + ggtitle("ICC estimate of the Singapore cluster serial interval")
ggsave(file="sing_serialint.pdf", height = 4, width = 6)
```
I note that the serial interval gets longer if we include more cases per cluster (because the mixture of 4 pathways in Vink et al does not include longer transmission chains, which forces the assumption that everyone in the cluster was infected by the initial case, which in turn lengthens the estimated serial interval). We do not know the true infection pathways but it is reasonable not to constrain the model to enforce that most are infected by the first few cases.
```{r, eval=FALSE}
# bootstrap analysis
Nboot=100
bestimates=myestimate # NOTE this loop had errors a few times; I just restarted it.
for (kk in 1:Nboot) {
bdata = sample(x=icc4, size = length(icc4), replace = T)
bestimates = rbind(bestimates, serial_mix_est(data=bdata, N=50, startmu=10, startsig =4))
}
#The mean of the mean serial intervals is`r mean(bestimates[,1])` days and the standard deviation of these means is `r sd(bestimates[,1])`.
```
```{r, eval=FALSE}
hist(bestimates[,1],breaks = 30)
bootdf=data.frame(mu=bestimates[,1], sig=bestimates[,2])
ggplot(bootdf, aes(x=mu, y=sig))+geom_point()
ggplot(bootdf, aes(x=mu))+geom_histogram()
ggsave(file = "bootst_SI_sing.pdf", width = 6, height = 4)
mean(bestimates[,1]) # higher estimates also have high uncertainty
median(bestimates[,1])
sd(bestimates[,1])
mean(bestimates[,2])
sd(bestimates[,2])
# range for mean serial interval in the paper is 4.49 (result for k=4) plus/min 1.96*sd(bestimates)
```
We estimate R0 from Wallinga and Lipsitch Proc. Roy. Soc. B 2007 using the equation $R=\exp{r \mu - 1/2 r^2 \sigma^2}$. To obtain CIs for R, we could use our bootstrap estimates of $\mu$ and $\sigma^2$ and simply resample R using this equation.
Jung et al Scenario 1
```{r,eval=FALSE}
load("sing_boots_100.Rdata") # in case in Rmd with above evals set to FALSE
myrate=0.15
Rs=0*(1:100)
for (n in 1:100) {
Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
sd(Rs)
hist(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```
```{r,eval=FALSE}
myrate=0.29
Rs=0*(1:100)
for (n in 1:100) {
Rs[n]= exp(myrate*bestimates[n,1] - 0.5*(myrate)^2*bestimates[n,2]^2)
}
hist(Rs,breaks = 30)
mean(Rs)
quantile(Rs, probs = c(0.025, 0.975))
```