Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Data Type with Sparse Matrix #51

Open
bmweiner opened this issue Jan 10, 2016 · 1 comment
Open

Object Data Type with Sparse Matrix #51

bmweiner opened this issue Jan 10, 2016 · 1 comment

Comments

@bmweiner
Copy link

Per #37, scipy.sparse.hstack is called whenver a sparse matrix is in extracted. However, scipy.sparse.hstack cannot upcast dtype=object, so even if sparse=False for the mapper object, the hstack will fail whenver a np.ndarray of dtype=object is involved.

Passing example, note upcasts int64/float64 to float64.

In [432]:

df = pd.DataFrame({'int':[1,2,3],
                   'flt':[2.,3,4],
                   'obj':['r','w','b']})
mapper = sklearn_pandas.DataFrameMapper([
        (['int'],[sklearn.preprocessing.OneHotEncoder()]),        
        (['flt'],[sklearn.preprocessing.OneHotEncoder()])
        ], sparse=True)
mapper.fit_transform(df)

Out[432]:
<3x6 sparse matrix of type '<type 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

Failing example, unable to upcast int64/object see scipy\sparse\sputils.pyc for upcast code.

In [434]:

mapper = sklearn_pandas.DataFrameMapper([
        (['int'],[sklearn.preprocessing.OneHotEncoder()]),        
        ('obj', None)])

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

I think it's ok if an error is thrown when sparse=True and an array of type object is involved, but not if sparse=False.

I'll submit a pull request with a recommended fix.

bmweiner added a commit to bmweiner/sklearn-pandas that referenced this issue Jan 10, 2016
@bmweiner
Copy link
Author

Submitted pull request #52.

With the modified code, a sparse of int64 and dense of object combines to dense correctly:

In [21]:

df = pd.DataFrame({'int':[1,2,3],
                   'flt':[2.,3,4],
                   'obj':['r','w','b']})
mapper = sklearn_pandas.DataFrameMapper([
        (['int'],[sklearn.preprocessing.OneHotEncoder()]),        
        ('obj', None)])
mapper.fit_transform(df)

Out[21]:
array([[1.0, 0.0, 0.0, 'r'],
       [0.0, 1.0, 0.0, 'w'],
       [0.0, 0.0, 1.0, 'b']], dtype=object)

But fails when trying to combine as sparse (expected):

In [23]:
mapper = sklearn_pandas.DataFrameMapper([
        (['int'],[sklearn.preprocessing.OneHotEncoder()]),        
        ('obj', None)],
        sparse=True)
mapper.fit_transform(df)   

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant