-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect standard errors in predictions() of clm() with scale effects #718
Comments
Thanks a lot for the detailed example. I really appreciate it. There are two issues. First, library(ordinal)
library(marginaleffects)
dat <- transform(mtcars, cyl = factor(cyl), vs2 = vs)
mod <- clm(cyl ~ hp, scale = ~ vs, data = dat)
insight::find_variables(mod)
# $response
# [1] "cyl"
#
# $conditional
# [1] "hp" This is a bug which should probably be reported to the The workaround in packageVersion("marginaleffects")
# [1] '0.11.0.9000' We now get: # removing the response from the prediction dataset forces `predict.clm()` to return all levels
nd <- subset(dat, select = -cyl)
mod1 <- clm(cyl ~ hp + vs, scale = ~ vs, data = dat)
mod2 <- clm(cyl ~ hp + vs, scale = ~ vs2, data = dat)
pre1 <- predictions(mod1, newdata = nd)
pre2 <- predictions(mod2, newdata = nd)
pre3 <- predict(mod1, newdata = nd, type = "prob", se.fit = TRUE)
# identical results
subset(pre1, rowid %in% 1:2)
#
# Group Estimate Std. Error z Pr(>|z|) CI low CI high
# 4 0.2003 0.206 0.975 0.32973 -0.203 0.603
# 4 0.2003 0.206 0.975 0.32973 -0.203 0.603
# 6 0.7469 0.239 3.130 0.00175 0.279 1.215
# 6 0.7469 0.239 3.130 0.00175 0.279 1.215
# 8 0.0527 0.104 0.507 0.61214 -0.151 0.257
# 8 0.0527 0.104 0.507 0.61214 -0.151 0.257
#
# Columns: rowid, group, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, disp, hp, drat, wt, qsec, vs, am, gear, carb, vs2, cyl
subset(pre2, rowid %in% 1:2)
#
# Group Estimate Std. Error z Pr(>|z|) CI low CI high
# 4 0.2003 0.206 0.975 0.32973 -0.203 0.603
# 4 0.2003 0.206 0.975 0.32973 -0.203 0.603
# 6 0.7469 0.239 3.130 0.00175 0.279 1.215
# 6 0.7469 0.239 3.130 0.00175 0.279 1.215
# 8 0.0527 0.104 0.507 0.61214 -0.151 0.257
# 8 0.0527 0.104 0.507 0.61214 -0.151 0.257
#
# Columns: rowid, group, estimate, std.error, statistic, p.value, conf.low, conf.high, mpg, disp, hp, drat, wt, qsec, vs, am, gear, carb, vs2, cyl
head(pre3$fit, 2)
# 4 6 8
# 1 0.2003397 0.7469119 0.05274834
# 2 0.2003397 0.7469119 0.05274834
head(pre3$se.fit, 2)
# 4 6 8
# 1 0.2055757 0.2387547 0.1041193
# 2 0.2055757 0.2387547 0.1041193 |
Hi @vincentarelbundock, using your updated development version, I can confirm that the SE calculation is corrected even when scale = ~ ... contains common terms with those in the regular formula = ~ ... However, the associated hypotheses() could also get an update in terms of the variable naming scheme: Currently, whenever scale = ~ ... shares a common term name with one in formula = ~ ... AND nominal = ~ ... shares a common term with one in formula = ~ ... , the SE and all inferential statistics of the scale effects in scale = ~ ... are NA. See the example below summary(Model <- clm( # nominal + scale (also in location) effects
cyl ~ vs + hp, scale = ~ vs, nominal = ~ hp,
data = mtcars %>% mutate(cyl = factor(cyl))))
coef(summary(Model))
" Estimate Std. Error z value Pr(>|z|)
4|6.(Intercept) 1.111073e+05 1.841641e+06 0.06033059 0.95189234
6|8.(Intercept) 9.176839e+00 5.674996e+00 1.61706527 0.10586419
4|6.hp -1.220842e+03 2.023763e+04 -0.06032534 0.95189652
6|8.hp -6.522225e-02 3.608747e-02 -1.80733775 0.07070963
vs -2.279935e+04 3.780999e+05 -0.06029979 0.95191687
hp NA NA NA NA
vs 8.617176e+00 1.657530e+01 0.51988047 0.60314689"
hypotheses(Model) # Note the last row has only point est
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b1 1.11e+05 1.84e+06 0.0603 0.9519 -3.50e+06 3.72e+06
b2 9.18e+00 5.67e+00 1.6171 0.1059 -1.95e+00 2.03e+01
b3 -1.22e+03 2.02e+04 -0.0603 0.9519 -4.09e+04 3.84e+04
b4 -6.52e-02 3.61e-02 -1.8073 0.0707 -1.36e-01 5.51e-03
b5 -2.28e+04 3.78e+05 -0.0603 0.9519 -7.64e+05 7.18e+05
b6 NA NA NA NA NA NA
b7 8.62e+00 NA NA NA NA NA"
hypotheses(Model, hypothesis = "b7 = 8")
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b7 = 8 0.617 NA NA NA NA NA" The error disappears once the redundant term is removed from the regular formula = ~ ... if it is already in the nominal = ~ ... summary(Model <- clm( # identical model as above; hp removed from regular formula
cyl ~ vs, scale = ~ vs, nominal = ~ hp,
data = mtcars %>% mutate(cyl = factor(cyl))))
coef(summary(Model)) # None changed, only the aliased hp row removed
" Estimate Std. Error z value Pr(>|z|)
4|6.(Intercept) 1.111073e+05 1.841641e+06 0.06033059 0.95189234
6|8.(Intercept) 9.176839e+00 5.674996e+00 1.61706527 0.10586419
4|6.hp -1.220842e+03 2.023763e+04 -0.06032534 0.95189652
6|8.hp -6.522225e-02 3.608747e-02 -1.80733775 0.07070963
vs -2.279935e+04 3.780999e+05 -0.06029979 0.95191687
vs 8.617176e+00 1.657530e+01 0.51988047 0.60314689"
hypotheses(Model) # Now SE of scale vs is fine
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b1 1.11e+05 1.84e+06 0.0603 0.9519 -3.50e+06 3.72e+06
b2 9.18e+00 5.67e+00 1.6171 0.1059 -1.95e+00 2.03e+01
b3 -1.22e+03 2.02e+04 -0.0603 0.9519 -4.09e+04 3.84e+04
b4 -6.52e-02 3.61e-02 -1.8073 0.0707 -1.36e-01 5.51e-03
b5 -2.28e+04 3.78e+05 -0.0603 0.9519 -7.64e+05 7.18e+05
b6 8.62e+00 1.66e+01 0.5199 0.6031 -2.39e+01 4.11e+01"
hypotheses(Model, hypothesis = "b6 = 8")
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b6 = 8 0.617 16.6 0.0372 0.97 -31.9 33.1" |
#99 Just a thought: Could this variable naming issue the cause for malfunction of predictions() with the lme() model? If a data frame is supplied to the newdata = ... argument in predictions(), maybe it resolves some issues. |
I opened an issue reporting variable-left-out in {insight} easystats/insight#727 |
In principle this could work, but I do not want to add support for a model if it requires hacks on the user's part. To add a model, I'd like Re: I believe that this is because your model is ill-defined. See the warning and message about “1 not defined because of singularities”. For internal reasons, it would be very tricky for me to work around this problem. It is much easier for the user to fit a proper model instead. library(ordinal)
mod <- clm(
cyl ~ vs + hp, scale = ~ vs, nominal = ~ hp,
data = transform(mtcars, cyl = factor(cyl)))
# Warning: (-3) not all thresholds are increasing: fit is invalid
# In addition: Absolute and relative convergence criteria were met
summary(mod)
# formula: cyl ~ vs + hp
# scale: ~vs
# nominal: ~hp
# data: transform(mtcars, cyl = factor(cyl))
#
# link threshold nobs logLik AIC niter max.grad cond.H
# logit flexible 32 -8.01 28.02 95(20) 1.10e-13 1.0e+17
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error z value Pr(>|z|)
# vs -22799 378100 -0.06 0.952
# hp NA NA NA NA
#
# log-scale coefficients:
# Estimate Std. Error z value Pr(>|z|)
# vs 8.617 16.575 0.52 0.603
#
# Threshold coefficients:
# Estimate Std. Error z value
# 4|6.(Intercept) 1.111e+05 1.842e+06 0.060
# 6|8.(Intercept) 9.177e+00 5.675e+00 1.617
# 4|6.hp -1.221e+03 2.024e+04 -0.060
# 6|8.hp -6.522e-02 3.609e-02 -1.807 |
No, the reason for NAs in hypotheses() missing SEs is exactly a variable-naming issue. For a model without such warnings "(-3) not all thresholds are increasing," hypotheses() will still have missing SEs for the scale = ~ ... coefficients if nominal = ~ ... contains some variables already in the regular formula. |
Please show me an example of a model where |
I take back my suggestion above, as adding "a suffix or prefix" to variable names in a data frame is a dangerous move. If formula = ~ ..., scale = ~ ..., and nominal = ~ ... share some common variables, such as summary(Model <- clm( # nominal + scale (also in location) effects
cyl ~ vs + hp, scale = ~ vs, nominal = ~ hp,
data = mtcars %>% mutate(cyl = factor(cyl)))) When making adjusted predictions, we do want to ensure that if vs = 0 becomes vs = 1 when evaluated in regular formula = ~ ..., the same change happens in scale = ~ ... and nominal = ~ ... Therefore, assigning different names to the same variable in different formula components (e.g. name vs s.vs in scale = ~ ...) during datagrid() building is unwise and results in mistakes. Could you confirm that the latest version {marginaleffects} builds counterfactual data correctly, so an incremental change in a variable takes effect in all terms using the variable in all formula components? (However, this behavior is undesired for some other packages like mlogit, where an incremental change in a variable should take different schemes in different formula components, as described in #719) |
There will be no examples that match the occasion you described because coef(clm()) will have NA for point estimate, SE, and so on of an "aliased" coefficient in the location effect (term in regular formula = ~ ...) if nominal = ~ ... contains the same term. These NA are not due to convergence or singularity but because the aliased coefficient will be redundant or necessarily 0. This is like the case when using K dummy variables to represent a K-category factor in lm(). The coef() will produce NA for the last dummy variable due to redundancy. summary(Model <- lm( # okay to have a rank-deficient model matrix
mpg ~ I(vs == 0) + I(vs == 1), data = mtcars))
"Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.557 1.224 20.058 < 2e-16 ***
I(vs == 0)TRUE -7.940 1.632 -4.864 3.42e-05 ***
I(vs == 1)TRUE NA NA NA NA "
hypotheses(Model, "b3 = 0") # correct
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b3 = 0 NA NA NA NA NA NA" The problem with hypotheses() is NOT on coef() rows with NA: In certain scenarios (nominal = ~ ... shares some term with formula = ~ ..., so some coefficients in formula = ~ ... will be NA, AND scale = ~ ... shares some term with formula = ~ ...), all inferential statistics like SE are NA (but point estimate is retained) for coefficients in scale = ~ ... ONLY, when coef(summary(clm())) won't have such a problem. My previous examples in #718 (comment) show that the strange behavior of hypotheses() is simply due to variable naming. Bad news is that with the latest development version, I discovered that such a strange SE behavior is not simply a display issue in hypotheses() but spills over to predictions() (See the test example below) and comparisons() (I suspect). summary(Model <- clm( # nominal + scale (also in location) effects
cyl ~ vs + carb, scale = ~ vs, nominal = ~ carb,
data = mtcars %>% mutate(cyl = factor(cyl))))
coef(summary(Model))
" Estimate Std. Error z value Pr(>|z|)
4|6.(Intercept) -2.5494136 1.0727656 -2.3764871 0.017478373
6|8.(Intercept) -2.5571012 1.7069284 -1.4980717 0.134114621
4|6.carb -0.0981885 0.3095422 -0.3172055 0.751087646
6|8.carb 0.3369959 0.3950073 0.8531385 0.393582510
vs -2.8264950 1.0428324 -2.7104020 0.006720171
carb NA NA NA NA
vs -2.3381336 3.0079436 -0.7773196 0.436970217"
hypotheses(Model) # Note the last row has only point est
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b1 -2.5494 1.073 -2.376 0.01748 -4.652 -0.447
b2 -2.5571 1.707 -1.498 0.13411 -5.903 0.788
b3 -0.0982 0.310 -0.317 0.75109 -0.705 0.509
b4 0.3370 0.395 0.853 0.39358 -0.437 1.111
b5 -2.8265 1.043 -2.710 0.00672 -4.870 -0.783
b6 NA NA NA NA NA NA
b7 -2.3381 NA NA NA NA NA"
hypotheses(Model, hypothesis = "b7 = -2") # SE is NA, but should be 3.01
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b7 = -2 -0.338 NA NA NA NA NA"
sapply(predict( # Pr{cyl == 4} for first five cars, using predict.clm()
Model, newdata = mtcars[ , -2], se.fit = T, interval = T),
function(x) x[1:5, 1])
" fit se.fit lwr upr
1 0.05010800 0.05846027 0.004728286 0.3693789
2 0.05010800 0.05846027 0.004728286 0.3693789
3 0.86455708 0.10809045 0.511116106 0.9749829
4 0.86455708 0.10809045 0.511116106 0.9749829
5 0.06032482 0.05514626 0.009447379 0.3017337"
predictions(Model, newdata = mtcars)[1:5, ] # Very different SE on row 3, 4
" Group Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % mpg disp hp
4 0.0501 0.0585 0.857 0.391 -0.0645 0.165 21.0 160 110
4 0.0501 0.0585 0.857 0.391 -0.0645 0.165 21.0 160 110
4 0.8646 0.6713 1.288 0.198 -0.4511 2.180 22.8 108 93
4 0.8646 0.6713 1.288 0.198 -0.4511 2.180 21.4 258 110
4 0.0603 0.0551 1.094 0.274 -0.0478 0.168 18.7 360 175" Like the case in #718 (comment), my current resolution is to remove redundant terms "from the regular formula = ~ ... if it is already in the nominal = ~ ...". Once done, coef() has no aliased rows (no NA), and predictions() and hypotheses() work correctly. summary(Model <- clm( # identical model as above
cyl ~ vs, scale = ~ vs, nominal = ~ carb,
data = mtcars %>% mutate(cyl = factor(cyl))))
coef(summary(Model)) # None changed, only the aliased carb row removed
" Estimate Std. Error z value Pr(>|z|)
4|6.(Intercept) -2.5494136 1.0727656 -2.3764871 0.017478373
6|8.(Intercept) -2.5571012 1.7069284 -1.4980717 0.134114621
4|6.carb -0.0981885 0.3095422 -0.3172055 0.751087646
6|8.carb 0.3369959 0.3950073 0.8531385 0.393582510
vs -2.8264950 1.0428324 -2.7104020 0.006720171
vs -2.3381336 3.0079436 -0.7773196 0.436970217"
hypotheses(Model) # Now SE of scale = ~ vs is fine
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b1 -2.5494 1.073 -2.376 0.01748 -4.652 -0.447
b2 -2.5571 1.707 -1.498 0.13411 -5.903 0.788
b3 -0.0982 0.310 -0.317 0.75109 -0.705 0.509
b4 0.3370 0.395 0.853 0.39358 -0.437 1.111
b5 -2.8265 1.043 -2.710 0.00672 -4.870 -0.783
b6 -2.3381 3.008 -0.777 0.43697 -8.234 3.557"
hypotheses(Model, hypothesis = "b6 = -2") # Now SE = 3.01 as expected
" Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b6 = -2 -0.338 3.01 -0.112 0.91 -6.23 5.56"
sapply(predict(
Model, newdata = mtcars[ , -2], se.fit = T, interval = T),
function(x) x[1:5, 1]) # same as before
" fit se.fit lwr upr
1 0.05010800 0.05846027 0.004728286 0.3693789
2 0.05010800 0.05846027 0.004728286 0.3693789
3 0.86455708 0.10809045 0.511116106 0.9749829
4 0.86455708 0.10809045 0.511116106 0.9749829
5 0.06032482 0.05514626 0.009447379 0.3017337"
predictions(Model, newdata = mtcars)[1:5, ] # Same as predict()
" Group Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % mpg disp hp
4 0.0501 0.0585 0.857 0.391 -0.0645 0.165 21.0 160 110
4 0.0501 0.0585 0.857 0.391 -0.0645 0.165 21.0 160 110
4 0.8646 0.1081 7.999 <0.001 0.6527 1.076 22.8 108 93
4 0.8646 0.1081 7.999 <0.001 0.6527 1.076 21.4 258 110
4 0.0603 0.0551 1.094 0.274 -0.0478 0.168 18.7 360 175"" Would you update the SE calculation to account for possibilities that formula = ~ ... hold redundant coefficients with NA if the same term is already in the nominal = ~ ..." ? As it is also a variable naming issue, it should be a similar approach to what you did in #718 (comment). |
@MrJerryTAO can you try the latest version from Github? |
I just did, using the examples in #718 (comment). I found that in the latest version of marginaleffects the SEs in scale effects look correct, even with redundant terms in the formula with aliased NA coefficients. Thanks, @vincentarelbundock. Now I feel safe describing the marginal effects in my dissertation :) |
Yay!! |
Hi @vincentarelbundock, I examined {marginaleffects} with ordinal::clm objects and found incorrect SE estimation when clm() has both location and scale effects from the same variable. See the mtcar example below:
It shows that predictions() result in very different SE of probability mass than those produced by predict() when scale effects are different from exp(0) = 1. Such large differences are not caused by delta method approximation or SE of different estimands. I suspect that the SE extractor in predictions() cannot save the vcov() information completely when one or more of scale-effect variables share an identical term name with some location-effect variables. The following update shows that variable naming is exactly the cause.
Once location and scale effect terms are named uniquely, the disparity between predictions() and predict() is gone. I think an easy fix from the user's side is to copy and rename a variable, so that one variable in the scale effect has a different name from its counterpart in the location effect although both have identical values.
The above example also reflects a data frame building issue that the default predictions() cannot correctly retrieve variables used in scale = ~ ... and nominal = ~ ... if they are not present in the regular formula. Explicitly specifying newdata = ... resolves this "object not found" error. If a term is in both location and scale formula, then default data-frame building is okay, but the mistaken SE problem in predictions() emerges. I checked and confirmed that common term names in nominal = ~ ... and formula = ~ ... won't cause a problem in predictions(), as clm() will prefix variable names in nominal = ~ ... with threshold names.
Will you fix the variable naming scheme in predictions(), so it treats a variable in formula = ~ ..., scale = ~ ..., and nominal = ~ ... separately despite common term names? Maybe add a suffix or prefix. Or throw an error if multiple formula term labels conflict. The current predictions() output in such a scenario without a warning or error message can be misleading if the user takes the output for granted without realizing possible mistakes. As the example above shows, the SE disparity and inference consequence is not trivial.
Furthermore, the interval estimates from predictions() are very different from predict(), as the former uses normal approximation while the latter derive an SE on the linear predictor and transform back to probability intervals. Therefore, some probability CI made by predictions() is far outside of [0, 1]. Is it possible for predictions() to give asymmetrical confidence intervals based on the model link and a linear predictor? The current transform = function() argument in predictions() cannot salvage the situation: It transforms the already generated normal CI (which could be outside of 0 to 1) instead of transforming the estimate before building a CI.
The text was updated successfully, but these errors were encountered: