-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicates handling in miaTime #69
Comments
Instead of separately dealing with this in every possible function, I would create a utility function that can be used to remove such duplicates. If needed. But not sure if this would make sense in general because the duplicate removal details may depend very much on each particular study. As a first pass, perhaps just an example in an appropriate place in vignette, mentioning that this can be a problem (subsection perhaps?), and then showing how to deal with such situations and filter out duplicates is enough? |
The task here is to add some example showing how to handle duplicate entries in coldata()?
|
Did you try to run that code? It doesn't seem correct to me (duplicated_rows is DataFrame and not logical vector, yet you are taking its negation ("!"))? And "unique_indices" is not defined in the code (used in last row). Make sure the code examples work before pasting. Anyways, the most relevant case could be to identify cases where time point is duplicated for a given grouping variable (e.g. subject). Like if subject A has two measurements on day 2 (or time point 42.1). These are potentially probematic cases and sometimes would need to be flagged and/or removed. We could have a simple flagging function to detect such cases but not sure if this is worth a wrapper. -> Prepare minimal reproducible example showing how to deal with such case including flagging and then removal? |
you'r3 right, there was an oversight in the example. I was unable to find a dataset with duplicate entries for subjects and timepoints so I am going to create one with an existing dataset.
we can use duplicated() against coldata to catch the multiple entries across subjects and timepoints
|
Great - though I think we are more interested in picking the non-duplicated set. Perhaps the example could be simplified a bit into just:
Hmm sometimes also other variables might be necessary to consider in addition subject and time. For instance, subject + time + bodysite. I am not sure if it is easy to make any sensible added-value wrapper to flag or remove duplicates so let's keep it like this. It is said the "SilvermanAGutData" has duplicates on time + subject. If this holds, check if there are examples with this data set in OMA and see if it is sensible (or not) to remove duplicates as processing step. |
I think we should also take this into account in our methods. Requiring user to remove data is suboptimal. At least, this could be done in divergence functions.
Give warning if there are duplicated time points
When reference samples are assigned, we can assign all samples from previous time point. This means that the vector is longer than there are samples (some samples have multiple time points). We extend the TreeSE.
We can calculate divergence just like before.
After calculating divergences, we can summaries them by taking mean or median etc.
|
Hmm, there is no example in OMA using
the only time variable here is |
Regarding @TuomasBorman comments:
On @Daenarys8 comment:
In conclusion, my suggestion is to either just ignore this issue for now and close it, or add a minimal example in miaTime vignette on how to deal with duplicates (perhaps even an example on averaging over samples like Tuomas demonstrated). Open to other suggestions. |
A vignette on averaging over samples will be good. |
It seems that Silverman data in miaTime has duplicated time points for each vessel |
Some datasets might have replicates for same subject/group per the same time point, as for instance "SilvermanAGutData" dataset has. Should current miaTime methods and upcoming ones be set to detect such cases and handle them accordingly (e.g. discarding, averaging ...)?
The text was updated successfully, but these errors were encountered: