Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does FSelectorRcpp produce the same results as FSelector? #51

Open
larskotthoff opened this issue Mar 15, 2017 · 16 comments
Open

Does FSelectorRcpp produce the same results as FSelector? #51

larskotthoff opened this issue Mar 15, 2017 · 16 comments
Milestone

Comments

@larskotthoff
Copy link

Do you guys have any tests to check this? We're thinking of replacing FSelector with FSelectorRcpp in mlr, but we'd like to be sure that we remain reproducible.

@berndbischl

@zzawadz
Copy link
Member

zzawadz commented Mar 15, 2017

For some functions - yes - see e.g. https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/tests/testthat/test-information_gain.R

This is part of the code:

test_that("Comparsion with FSelector", {
  expect_equal(information.gain(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris)$importance)

  expect_equal(gain.ratio(Species ~ ., data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "gainratio")$importance)

  expect_equal(symmetrical.uncertainty(Species ~ .,
                                       data = iris)$attr_importance,
               information_gain(formula = Species ~ ., data = iris,
                                type = "symuncert")$importance)
})

For other functions please send us a list of functionalities which must be checked against FSelector, and then we will prepare required tests to convince you that everything is fine:)

@larskotthoff
Copy link
Author

I'd love to see tests for all of the functions that users can call, ideally on a range of different inputs. Maybe using quickcheck (https://github.com/RevolutionAnalytics/quickcheck)?

@larskotthoff
Copy link
Author

Oh and once I'm convinced I'm willing to officially deprecate FSelector in favour of FSelectorRcpp.

@zzawadz
Copy link
Member

zzawadz commented Mar 15, 2017

Ok. We will work on this.

Thanks!

@MarcinKosinski
Copy link
Contributor

@zzawadz another amazing challenge for FSelectorRcpp : )

Maybe it'll be the easiest way to include FSelectorRcpp in the FSelector

@zzawadz
Copy link
Member

zzawadz commented Mar 15, 2017

@MarcinKosinski
Good idea. We can replace functionalities (inner implementation) in FSelector step by step to reach the convergence. @larskotthoff What do you think?

@larskotthoff
Copy link
Author

Sounds good. Pull requests welcome!

@MarcinKosinski
Copy link
Contributor

So this can be closed - #27 : ) @larskotthoff is aware of that we will suggest inner implementation

@MarcinKosinski
Copy link
Contributor

Getting back to this thread.
FSelectorRcpp will be available on CRAN again soon (removed because lack of informtion of C++ dependency) #69

To enable FSelectorRcpp be a part of FSelector engine I think we could try
substituting

FSelector:::information.gain.body() function with the FSelectorRcpp::information_gain(). We need to polish FSelectorRcpp edition to produce the same results as FSelector and also enable some another approaches to dealing with NAs and discretization of dependent variable.

2 tasks should be finished then

FSelector:::information.gain.body <- function(params, equal = TRUE) {
    FSelectorRcpp::information_gain(params, equal = equal)
}

@RandomGuessR
Copy link

Hi, I am struggling to get the same results from FSelectorRcpp and FSelector - posted under this issue: mlr-org/mlr#1677 (comment).
The results I get are actually very different, and the impact on an end model is large.
Would appreciate your help if I am doing anything wrong.
Thanks!

@zzawadz
Copy link
Member

zzawadz commented Oct 19, 2018

@RandomGuessR

FSelectorRcpp treats integer columns like factors, not numeric, and because of that, it does not discretize them before calculating the information gain. You need to cast the integers columns into numerics to get the same result:

See the code below:

library(FSelectorRcpp)
library(FSelector)
dt <- read.csv("~/Downloads/all/train.csv")

dt2 <- data.frame(
  yy = dt$target,
  X0deb4b6a8 = dt$X0deb4b6a8,
  X0deb4b6a8Numeric = as.numeric(dt$X0deb4b6a8)
)

information_gain(yy ~ ., dt2, equal = TRUE)
#          attributes  importance
# 1        X0deb4b6a8 0.001443917
# 2 X0deb4b6a8Numeric 0.000000000


information.gain(yy ~ ., dt2)
#                   attr_importance
# X0deb4b6a8                      0
# X0deb4b6a8Numeric               0

@RandomGuessR
Copy link

RandomGuessR commented Oct 19, 2018

Thanks for helping with this so quickly!
Might be good to document this difference somewhere in the package(s)

@MarcinKosinski
Copy link
Contributor

MarcinKosinski commented Oct 19, 2018 via email

@zzawadz
Copy link
Member

zzawadz commented Oct 19, 2018

@RandomGuessR @MarcinKosinski

I found an inconsistent behavior in FSelectorRcpp:( The information_gain does not discretize integers, but discretize do this:( I consider this as a bug, and I'll fix this.

@RandomGuessR
Copy link

Thanks @zzawadz.

After changing the data from integer to numeric, FSelectorRcpp works like a treat; really happy with the performance.
The RWeka-based implementation was too slow for most real-world practical purposes.

@zzawadz
Copy link
Member

zzawadz commented Nov 10, 2018

We (I should say me) decided that FSelectorRcpp will try to mimic the behavior of FSelector so that since 0.3.0 integers will be treated as numerics by default, not factors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants