Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand supervised ML chapter #671

Open
wants to merge 1 commit into
base: devel
Choose a base branch
from

Conversation

Benjamin-Valderrama
Copy link

Added new content, code and references to the chapter. Removed a library that is not used anymore from DESCRIPTION.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Check the comments.

I am wondering if we should more clearly separate the topics to

  1. Background
  2. Preprocessing
  3. Training
    • Binary classification
    • Regression task: for instance age (for instance data(hitchip1006, package ="miaTime"), even though the data is not 16S or shotgun)
  4. Model metrics
    • Typical model metrics for classification, also say couple words about multiclass
    • Metrics for regression
  5. Visualization
    • Classification
    • Regression
  6. Feature importance

::: {.callout-note}
## Note: ML in multi-omics data analysis
ML applications for the integration of multi-omic datasets is covered in
[@sec-cross-correlation], [@sec-multiassay_ordination] and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sec-cross-correlation is just about calculating correlations, so it can be removed from here

Comment on lines +93 to +95
##
## healthy T2D
## 193 170
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is printed so the comment can be removed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In added |# output : FALSE as part of the code chunk options, so the output of that table call won't be printed (at least it didn't when rendering locally). Do you wan't me to remove the comment and allow the outputs to be shown then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not look carefully enough. I think you could print the table. It might be easier if we decide to change the dataset at some point.

Same thing also with the text. If there are specific interpretations from the results, e.g., "this bacteria x is the most important feature in prediction", the "x" could be inline code. It would automatically update for new dataset.

Comment on lines +137 to +140
tse_prev <- subsetByPrevalent(tse,
assay.type = "relative_abundance",
prevalence = 10/100)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To harmonize the style, use:

tse_prev <- subsetByPrevalent(
    tse, 
    assay.type = "relative_abundance",
    prevalence = 10/100
)

with 4 spaces indentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check also other chunks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the line width should be max 80 characters

Comment on lines +141 to +146
# calculate all available alpha diversity measures
vars_before <- colnames(colData(tse))
tse_prev <- addAlpha(tse_prev, assay.type = "relative_abundance")

# We calculate all available alpha diversity measures
variables_before <- colnames(colData(tse))
tse <- addAlpha(tse, assay.type = "relative_abundance")
# By comparing variables, we get which indices were calculated
index <- colnames(colData(tse))[ !colnames(colData(tse)) %in% variables_before ]
index <- colnames(colData(tse_prev))[!colnames(colData(tse_prev)) %in% vars_before]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done simpler way by retrieving all column names that have "diversity", "evenness" or other suffix

Comment on lines +174 to +186
# Preprocessing step 3 -
# Group predictors (taxa or alpha diversities) with perfect correlation
m <- cor(assay)
cor_df <- data.frame(row = rownames(m)[as.vector(row(m))],
col = colnames(m)[as.vector(col(m))],
cor = c(m)) |>
filter(row > col & abs(cor) == 1)
##
## row col cor
## simpson_lambda_dominance gini_simpson_diversity -1
## relative_dominance dbp_dominance 1
## observed_richness chao1_richness 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mikropml does have a function that can be used to do this --> use it instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


It was already shown how to obtain the performance of the model. However,
those results are valid for a particular 80/20 split of train and test
data(determined by the seed used), and thus it represents the performance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data(determined by the seed used) --> add space

Comment on lines +354 to +355
rf_list <- multiple_rf
rf_list[[3]] <- rf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work
rf_list <- c(rf_list, lrf)

Comment on lines +365 to +399
#| label: superML5.1 - Plot AUCs

# Join model's performance df of each iteration of `run_ml`
rf_performance_df <- map(.x = rf_list,
.f = pluck,
"performance") |>
list_rbind() |>
# Get training and test metrics
select(seed,
method,
training = cv_metric_AUC,
test = AUC) |>
# Prepare data for plotting
pivot_longer(cols = c(training, test),
names_to = "split",
values_to = "AUC") |>
mutate(split = factor(x = split,
levels = c("training",
"test"))) |>
mutate(mean_AUC = mean(AUC), .by = split)

# Plot the performance of each iteration
rf_performance_df |>
ggplot(aes(x = split, y = AUC, fill = split)) +
geom_point(size = 4,
shape = 21,
show.legend = FALSE,
alpha = 0.65) +
stat_summary(geom = "pointrange",
fun.data = mean_se,
color = "black",
show.legend = FALSE) +
scale_y_continuous(limits = c(0.5, 1)) +
theme_classic()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be somehow simplified? This is rather complex and hard to follow for ordinary user

Comment on lines +319 to +323
## Inspect model's performance

It was already shown how to obtain the performance of the model. However,
those results are valid for a particular 80/20 split of train and test
data(determined by the seed used), and thus it represents the performance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should direct user to mikropml documentation instead of implementing this multimodel training to avoid duplication. Or does this include information that is not included in here: https://www.schlosslab.org/mikropml/articles/parallel.html

Comment on lines -255 to +503
data that lacks essential information. The dataset do not have sufficient
amount of data on carcinoma to build an accurate predictive model.
This plot shows the features with the highest effect on model's performance,
or in other words, which features have most of the information for
predicting the outcome. Notice that within the most influential features
there are some grouped variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could emphasize more the interpretation of results. What are the most important features? What this means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in ROC plot: What the ROC plot shows. What can we say based on the ROC plot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants