Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature importance p-values #10

Open
lang-benjamin opened this issue Dec 15, 2023 · 2 comments
Open

Feature importance p-values #10

lang-benjamin opened this issue Dec 15, 2023 · 2 comments

Comments

@lang-benjamin
Copy link

In addition to the permutation-based feature importance, there is permutation-based p-values for the feature importance (Altmann, A., Tolosi, L., Sander, O. & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure, Bioinformatics 26:1340-1347). There is essentially only the ranger package that implements this via the importance_pvalues function. Would you think that such a function is helpful? I could imagine that this may aid in judging whether a feature is relevant or not.

@brian-j-smith
Copy link
Owner

brian-j-smith commented Dec 16, 2023

Thanks for asking about permutation p-values. I hadn't thought about those before, but they can be calculated with the varimp() function. Below is an example of permutation-based calculations for variable importance followed by p-values. This method may differ from the Altmann et al. paper but is permutation-based nonetheless. Like variable importance, these p-values can computed for any model and with any appropriate performance metric supplied by the package.

# Load analytic packages
library(MachineShop)
library(ggplot2)

# Set up a parallel backend for faster permutations
library(doParallel)
registerDoParallel()

# Fit any MachineShop model
mdl_fit <- fit(sale_amount ~ ., data = ICHomes, model = GLMModel)

# Permutation variable importance

vi <- varimp(mdl_fit, samples = 1000)
plot(vi)

# Permutation p-values

## Custom varimp() stats function to compute permutation p-values
## Argument x is the difference between permuted and observed model performances
## for a variable
pval <- function(x) {
  c("pvalue" = min(2 * mean(x <= 0), 1))
}

## Call varimp() with the p-value function
permpval <- varimp(
  mdl_fit,
  scale = FALSE,
  samples = 1000,
  stats = pval
)
plot(permpval) + labs(y = "Permutation p-value")

@lang-benjamin
Copy link
Author

Thank you for the comment. I really like the flexibility of the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants