-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialization-deserialization bug #143
Comments
The fact that you cannot print the de-serialized preprocessor is not necessarily an issue I think? It seems to follow the same behaviour before/after (de)serialization. I was looking into the issue of why you cannot transform after de-serializing, and I think some information is lost in the (de-)serialization process. To give an example:
The _bins_by_column element is not visible when just looking at the _discretizer, but it is still there.
This is why (for me) I cannot directly transform new data after de-serializing. I already wanted to leave some information here, but will investigate this further. I can imagine this is also happening in the categorical data processor. A way forward would probably be to make sure the full information gets (de-)serialized. |
Patrick's logged issue #176 might be a duplicate, @patrickleonardy to check if yours was on the model and Joost's on the preprocessor. |
#176 relates to the serialization/de-serialization of the LinearRegressionModel and maybe the LogisticRegressionModel, So no duplicates here |
For me the issue is solved now. The main issue was in target_encoder.py. At line 126, there is a check on a parameter (_global_mean) of the target encoder. This is a floating number, in my case of type np.float64. In the if statement was only a check if type==float. This check failed, and hence the variable is left empty in the deserialization process. Therefore Cobra suspects the target_encoder was not fitted. I extended the check to take into account different kinds of floating numbers using: I have tested the entire flow with continuous and categorical variables and everything seems to work fine now. The debugging is documented in a notebook which at the moment pushed to git as well. |
Bug Report
After serializing and de-serializing a PreProcessor with only contiguous variables (to check if it is also the case when categorical variables are present)
Description
For the first point: It seems that the problem with the difference in the naming of the attributes and the parameters in the function definition. self._get_param_names() returns "categorical_data_processor" but getattr() only knows "_categorical_data_processor".
By changing the naming this problem is resolved is there no other way ?
For the second point: There is a problem when creating the pipeline_dictionary it seems that some keywords are empty even if they should have a value.
Steps to Reproduce
from sklearn.datasets import load_iris
import pandas as pd
X, y = load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X,y])
df = df.rename({0:"target"}, axis=1)
from cobra.preprocessing import PreProcessor
preprocessor = PreProcessor.from_params()
continuous_vars = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
discrete_vars = []
preprocessor.fit( df, continuous_vars= continuous_vars, discrete_vars= discrete_vars, target_column_name="target" )
pipeline_serialized = preprocessor.serialize_pipeline()
new_preprocessor = PreProcessor.from_pipeline(pipeline_serialized)
new_preprocessor
new_preprocessor.transform( df, continuous_vars= continuous_vars, discrete_vars= [] )
Actual Results
I got ...
The text was updated successfully, but these errors were encountered: