New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

expand supervised ML chapter #671

Open

Benjamin-Valderrama wants to merge 1 commit into microbiome:devel from Benjamin-Valderrama:devel

Benjamin-Valderrama commented Feb 26, 2025

Added new content, code and references to the chapter. Removed a library that is not used anymore from DESCRIPTION.


          expand supervised ML chapter

b8103de

TuomasBorman requested changes

View reviewed changes

Contributor

TuomasBorman left a comment

Looks good! Check the comments.

I am wondering if we should more clearly separate the topics to

Background
Preprocessing
Training
- Binary classification
- Regression task: for instance age (for instance data(hitchip1006, package ="miaTime"), even though the data is not 16S or shotgun)
Model metrics
- Typical model metrics for classification, also say couple words about multiclass
- Metrics for regression
Visualization
- Classification
- Regression
Feature importance

inst/pages/machine_learning.qmd

+              ::: {.callout-note}
+              ## Note: ML in multi-omics data analysis
+              ML applications for the integration of multi-omic datasets is covered in
+              [@sec-cross-correlation], [@sec-multiassay_ordination] and

Contributor

TuomasBorman Feb 27, 2025

@sec-cross-correlation is just about calculating correlations, so it can be removed from here

inst/pages/machine_learning.qmd

Comment on lines +93 to +95

+              ##
+              ## healthy     T2D
+              ##     193     170

Contributor

TuomasBorman Feb 27, 2025

This is printed so the comment can be removed

Author

Benjamin-Valderrama Feb 27, 2025

In added |# output : FALSE as part of the code chunk options, so the output of that table call won't be printed (at least it didn't when rendering locally). Do you wan't me to remove the comment and allow the outputs to be shown then?

Contributor

TuomasBorman Feb 28, 2025

Sorry, I did not look carefully enough. I think you could print the table. It might be easier if we decide to change the dataset at some point.

Same thing also with the text. If there are specific interpretations from the results, e.g., "this bacteria x is the most important feature in prediction", the "x" could be inline code. It would automatically update for new dataset.

inst/pages/machine_learning.qmd

Comment on lines +137 to +140

+              tse_prev <- subsetByPrevalent(tse,
+                                            assay.type = "relative_abundance",
+                                            prevalence = 10/100)

Contributor

TuomasBorman Feb 27, 2025

To harmonize the style, use:

tse_prev <- subsetByPrevalent(
    tse, 
    assay.type = "relative_abundance",
    prevalence = 10/100
)

with 4 spaces indentation

Contributor

TuomasBorman Feb 27, 2025

Check also other chunks

Contributor

TuomasBorman Feb 27, 2025

And the line width should be max 80 characters

inst/pages/machine_learning.qmd

Comment on lines +141 to +146

    
              # calculate all available alpha diversity measures

              vars_before <- colnames(colData(tse))

              tse_prev <- addAlpha(tse_prev, assay.type = "relative_abundance")

              # We calculate all available alpha diversity measures

              variables_before <- colnames(colData(tse))

              tse <- addAlpha(tse, assay.type = "relative_abundance")

              # By comparing variables, we get which indices were calculated

              index <- colnames(colData(tse))[ !colnames(colData(tse)) %in% variables_before ]

              index <- colnames(colData(tse_prev))[!colnames(colData(tse_prev)) %in% vars_before]

Contributor

TuomasBorman Feb 27, 2025

This could be done simpler way by retrieving all column names that have "diversity", "evenness" or other suffix

inst/pages/machine_learning.qmd

Comment on lines +174 to +186

+              # Preprocessing step 3 -
+              # Group predictors (taxa or alpha diversities) with perfect correlation
+              m <- cor(assay)
+              cor_df <- data.frame(row = rownames(m)[as.vector(row(m))],
+                                   col = colnames(m)[as.vector(col(m))],
+                                   cor = c(m)) |>
+                      filter(row > col & abs(cor) == 1)
+              ##
+              ##                      row                    col  cor
+              ## simpson_lambda_dominance gini_simpson_diversity   -1
+              ##       relative_dominance          dbp_dominance    1
+              ##        observed_richness         chao1_richness    1

Contributor

TuomasBorman Feb 27, 2025

mikropml does have a function that can be used to do this --> use it instead

Contributor

TuomasBorman Feb 27, 2025

https://www.schlosslab.org/mikropml/reference/preprocess_data.html

inst/pages/machine_learning.qmd

+              It was already shown how to obtain the performance of the model. However,
+              those results are valid for a particular 80/20 split  of train and test
+              data(determined by the seed used), and thus it represents the performance of

Contributor

TuomasBorman Feb 27, 2025

data(determined by the seed used) --> add space

inst/pages/machine_learning.qmd

Comment on lines +354 to +355

		rf_list <- multiple_rf
		rf_list[[3]] <- rf

Contributor

TuomasBorman Feb 27, 2025

Does this work
rf_list <- c(rf_list, lrf)

inst/pages/machine_learning.qmd

Comment on lines +365 to +399

+              #| label: superML5.1 - Plot AUCs
+              # Join model's performance df of each iteration of `run_ml`
+              rf_performance_df <- map(.x = rf_list,
+                                       .f = pluck,
+                                       "performance") |>
+                      list_rbind() |>
+                      # Get training and test metrics
+                      select(seed,
+                             method,
+                             training = cv_metric_AUC,
+                             test = AUC) |>
+                      # Prepare data for plotting
+                      pivot_longer(cols = c(training, test),
+                                   names_to = "split",
+                                   values_to = "AUC") |>
+                      mutate(split = factor(x = split,
+                                            levels = c("training",
+                                                       "test"))) |>
+                      mutate(mean_AUC = mean(AUC), .by = split)
+              # Plot the performance of each iteration
+              rf_performance_df |>
+                      ggplot(aes(x = split, y = AUC, fill = split)) +
+                      geom_point(size = 4,
+                                 shape = 21,
+                                 show.legend = FALSE,
+                                 alpha = 0.65) +
+                      stat_summary(geom = "pointrange",
+                                   fun.data = mean_se,
+                                   color = "black",
+                                   show.legend = FALSE) +
+                      scale_y_continuous(limits = c(0.5, 1)) +
+                      theme_classic()

Contributor

TuomasBorman Feb 27, 2025

Can this be somehow simplified? This is rather complex and hard to follow for ordinary user

inst/pages/machine_learning.qmd

Comment on lines +319 to +323

+              ## Inspect model's performance
+              It was already shown how to obtain the performance of the model. However,
+              those results are valid for a particular 80/20 split  of train and test
+              data(determined by the seed used), and thus it represents the performance of

Contributor

TuomasBorman Feb 27, 2025

I wonder if we should direct user to mikropml documentation instead of implementing this multimodel training to avoid duplication. Or does this include information that is not included in here: https://www.schlosslab.org/mikropml/articles/parallel.html

inst/pages/machine_learning.qmd

Comment on lines -255 to +503

-              data that lacks essential information. The dataset do not have sufficient
-              amount of data on carcinoma to build an accurate predictive model.
+              This plot shows the features with the highest effect on model's performance,
+              or in other words, which features have most of the information for
+              predicting the outcome. Notice that within the most influential features
+              there are some grouped variables.

Contributor

TuomasBorman Feb 27, 2025

We could emphasize more the interpretation of results. What are the most important features? What this means?

Contributor

TuomasBorman Feb 27, 2025

Also in ROC plot: What the ROC plot shows. What can we say based on the ROC plot?

TuomasBorman requested a review from antagomir

February 28, 2025 07:15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet