Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicated samples in divergence functions #102

Merged
merged 7 commits into from
Jan 23, 2025
Merged

Conversation

TuomasBorman
Copy link
Collaborator

getStepwiseDivergence:

I think it was not optimal to expect user to subset data so that there were no replicated samples. Now user gets warning if there are replicated timepoints, and the average is calculated.

@TuomasBorman
Copy link
Collaborator Author

Unit tests are still needed.

@TuomasBorman
Copy link
Collaborator Author

Now, if there are repeated samples, i.e.,

  • getBaselineDivergence: there are multiple samples in first time point for single group
  • getStepwiseDivergence: there are multiple samples in single time point for single group

the repeated samples are handled by calculating average. The output is like this:

> res <- getStepwiseDivergence(
+     tse, time.col = "time", group = "group", method = "euclidean") 
Warning message:
Some samples are associated with multiple reference samples. In these cases, the reference time point includes multiple samples, and their average is used. 
> res
DataFrame with 20 rows and 3 columns
         divergence time_diff                   ref_samples
          <numeric> <numeric>                        <list>
sample1      189737         2               sample5,sample9
sample2      379473         5                      sample14
sample3      326769         3 sample1,sample11,sample13,...
sample4      202386        94  sample2,sample8,sample10,...
sample5          NA        NA                            NA
...             ...       ...                           ...
sample16    63245.6         5                      sample14
sample17   316227.8         2               sample5,sample9
sample18   265631.3        94  sample2,sample8,sample10,...
sample19   379473.3         2               sample5,sample9
sample20   328876.9        94  sample2,sample8,sample10,...

@TuomasBorman TuomasBorman linked an issue Jan 23, 2025 that may be closed by this pull request
@TuomasBorman
Copy link
Collaborator Author

#69 (comment)

Hmm sometimes also other variables might be necessary to consider in addition subject and time. For instance, subject + time + bodysite.

I think the simplest solution for this is to create new grouping variables to colData.

tse[["group_bodysite"]] <- paste0(tse[["group"]], tse[["bodysite"]])
res <- getBaselineDivergence(tse, group = "group_bodysite")

Also, one can always input own reference samples.

@TuomasBorman
Copy link
Collaborator Author

"Multiple reference samples" handling is now moved to mia::getDivergence, it is more straightforward solution: microbiome/mia#680

@TuomasBorman TuomasBorman merged commit e4908a7 into devel Jan 23, 2025
3 checks passed
@TuomasBorman TuomasBorman deleted the repeated_samples branch January 23, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicates handling in miaTime
1 participant