-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expand supervised ML chapter #671
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Check the comments.
I am wondering if we should more clearly separate the topics to
- Background
- Preprocessing
- Training
- Binary classification
- Regression task: for instance age (for instance data(hitchip1006, package ="miaTime"), even though the data is not 16S or shotgun)
- Model metrics
- Typical model metrics for classification, also say couple words about multiclass
- Metrics for regression
- Visualization
- Classification
- Regression
- Feature importance
::: {.callout-note} | ||
## Note: ML in multi-omics data analysis | ||
ML applications for the integration of multi-omic datasets is covered in | ||
[@sec-cross-correlation], [@sec-multiassay_ordination] and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sec-cross-correlation is just about calculating correlations, so it can be removed from here
## | ||
## healthy T2D | ||
## 193 170 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is printed so the comment can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In added |# output : FALSE
as part of the code chunk options, so the output of that table
call won't be printed (at least it didn't when rendering locally). Do you wan't me to remove the comment and allow the outputs to be shown then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not look carefully enough. I think you could print the table. It might be easier if we decide to change the dataset at some point.
Same thing also with the text. If there are specific interpretations from the results, e.g., "this bacteria x is the most important feature in prediction", the "x" could be inline code. It would automatically update for new dataset.
tse_prev <- subsetByPrevalent(tse, | ||
assay.type = "relative_abundance", | ||
prevalence = 10/100) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To harmonize the style, use:
tse_prev <- subsetByPrevalent(
tse,
assay.type = "relative_abundance",
prevalence = 10/100
)
with 4 spaces indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check also other chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the line width should be max 80 characters
# calculate all available alpha diversity measures | ||
vars_before <- colnames(colData(tse)) | ||
tse_prev <- addAlpha(tse_prev, assay.type = "relative_abundance") | ||
|
||
# We calculate all available alpha diversity measures | ||
variables_before <- colnames(colData(tse)) | ||
tse <- addAlpha(tse, assay.type = "relative_abundance") | ||
# By comparing variables, we get which indices were calculated | ||
index <- colnames(colData(tse))[ !colnames(colData(tse)) %in% variables_before ] | ||
index <- colnames(colData(tse_prev))[!colnames(colData(tse_prev)) %in% vars_before] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done simpler way by retrieving all column names that have "diversity", "evenness" or other suffix
# Preprocessing step 3 - | ||
# Group predictors (taxa or alpha diversities) with perfect correlation | ||
m <- cor(assay) | ||
cor_df <- data.frame(row = rownames(m)[as.vector(row(m))], | ||
col = colnames(m)[as.vector(col(m))], | ||
cor = c(m)) |> | ||
filter(row > col & abs(cor) == 1) | ||
## | ||
## row col cor | ||
## simpson_lambda_dominance gini_simpson_diversity -1 | ||
## relative_dominance dbp_dominance 1 | ||
## observed_richness chao1_richness 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mikropml does have a function that can be used to do this --> use it instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
It was already shown how to obtain the performance of the model. However, | ||
those results are valid for a particular 80/20 split of train and test | ||
data(determined by the seed used), and thus it represents the performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data(determined by the seed used) --> add space
rf_list <- multiple_rf | ||
rf_list[[3]] <- rf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work
rf_list <- c(rf_list, lrf)
#| label: superML5.1 - Plot AUCs | ||
|
||
# Join model's performance df of each iteration of `run_ml` | ||
rf_performance_df <- map(.x = rf_list, | ||
.f = pluck, | ||
"performance") |> | ||
list_rbind() |> | ||
# Get training and test metrics | ||
select(seed, | ||
method, | ||
training = cv_metric_AUC, | ||
test = AUC) |> | ||
# Prepare data for plotting | ||
pivot_longer(cols = c(training, test), | ||
names_to = "split", | ||
values_to = "AUC") |> | ||
mutate(split = factor(x = split, | ||
levels = c("training", | ||
"test"))) |> | ||
mutate(mean_AUC = mean(AUC), .by = split) | ||
|
||
# Plot the performance of each iteration | ||
rf_performance_df |> | ||
ggplot(aes(x = split, y = AUC, fill = split)) + | ||
geom_point(size = 4, | ||
shape = 21, | ||
show.legend = FALSE, | ||
alpha = 0.65) + | ||
stat_summary(geom = "pointrange", | ||
fun.data = mean_se, | ||
color = "black", | ||
show.legend = FALSE) + | ||
scale_y_continuous(limits = c(0.5, 1)) + | ||
theme_classic() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be somehow simplified? This is rather complex and hard to follow for ordinary user
## Inspect model's performance | ||
|
||
It was already shown how to obtain the performance of the model. However, | ||
those results are valid for a particular 80/20 split of train and test | ||
data(determined by the seed used), and thus it represents the performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should direct user to mikropml documentation instead of implementing this multimodel training to avoid duplication. Or does this include information that is not included in here: https://www.schlosslab.org/mikropml/articles/parallel.html
data that lacks essential information. The dataset do not have sufficient | ||
amount of data on carcinoma to build an accurate predictive model. | ||
This plot shows the features with the highest effect on model's performance, | ||
or in other words, which features have most of the information for | ||
predicting the outcome. Notice that within the most influential features | ||
there are some grouped variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could emphasize more the interpretation of results. What are the most important features? What this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in ROC plot: What the ROC plot shows. What can we say based on the ROC plot?
Added new content, code and references to the chapter. Removed a library that is not used anymore from DESCRIPTION.