Skip to content
This repository has been archived by the owner on Jul 16, 2019. It is now read-only.

add test for timeseries index #135

Open
wants to merge 4 commits into
base: branch-0.8
Choose a base branch
from

Conversation

quasiben
Copy link
Member

@quasiben quasiben commented Mar 6, 2019

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works already with the upstream changes I take it?

dask_cudf/tests/test_core.py Outdated Show resolved Hide resolved
dask_cudf/tests/test_core.py Outdated Show resolved Hide resolved
@quasiben
Copy link
Member Author

quasiben commented Mar 6, 2019

I'm seeing something odd with dtype handling in dask-cudf and not cudf. cudf.set_index(...) does the right thing with dtype handling and the newly created index has type <M8[ms], whereas, the equivalent call: dask_cudf.set_index(...) has an index of dtype int64. Also note, when calling dask_cudf.head` the correct dtype is returned

In [2]: import dask.dataframe as dd                                                                                                                                                     

In [3]: import cudf 
   ...: import pandas as pd 
   ...: import numpy as np 
   ...: gdf = cudf.DataFrame() 
   ...: gdf['date'] = pd.date_range('11/20/2018', periods=72, freq='D') 
   ...: gdf['value'] = np.random.sample(len(gdf)) 
   ...:  
   ...: ddf = dd.from_pandas(gdf, npartitions=2) 
   ...:  
   ...: ddf_ts_idx = ddf.set_index('date')                                                                                                                                              

In [4]: gdf_idx = gdf.set_index('date')                                                                                                                                                 

In [5]: gdf_idx.index.dtype                                                                                                                                                             
Out[5]: dtype('<M8[ms]')

In [6]: ddf_ts_idx.index.dtype                                                                                                                                                          
Out[6]: dtype('int64')

In [7]: ddf_ts_idx.index.head()                                                                                                                                                         
Out[7]: DatetimeIndex([numpy.datetime64('2018-11-20T00:00:00.000'), numpy.datetime64('2018-11-21T00:00:00.000'), numpy.datetime64('2018-11-22T00:00:00.000'), numpy.datetime64('2018-11-23T00:00:00.000'), numpy.datetime64('2018-11-24T00:00:00.000')], dtype=datetime64[ms])

In [10]: ddf_ts_idx.index.head().dtype                                                                                                                                                  
Out[10]: dtype('<M8[ms]')

@mrocklin
Copy link
Collaborator

mrocklin commented Mar 6, 2019

You would want to track the ._meta attribute in the various stages of the set_index operation in dask-cudf and see when it goes wrong. I have to admit that I don't know much about dask-cudf's set_index operation. It relies on a bit of code in sort_values that is unique to dask-cudf that I haven't invested the time in understanding yet.

@quasiben
Copy link
Member Author

quasiben commented Mar 6, 2019

I think i've got it in cudf from exactly where you said. In the map_partitions method we call self._meta.set_index(colname) which eventually calls to https://github.com/rapidsai/cudf/blob/branch-0.6/python/cudf/dataframe/index.py#L541 . as_index comes back with a RangeIndex instead of DateTimeIndex. @kkraus14 do you have thoughts here ?

If I amend as_index to use:

isinstance(arbitrary, DatetimeColumn) or np.issubdtype(arbitrary.dtype, np.datetime64):

the correct index is used but I get assertion failures in the as_column calls in the Series constructor.

@mike-wendt mike-wendt changed the base branch from master to branch-0.6 March 14, 2019 05:07
@mike-wendt
Copy link
Contributor

If not targeted for v0.6 release please change the base to branch-0.7, thanks!

@kkraus14 kkraus14 changed the base branch from branch-0.6 to branch-0.8 May 14, 2019 20:46
@kkraus14 kkraus14 self-assigned this May 14, 2019
@kkraus14
Copy link
Contributor

rerun tests

@kkraus14
Copy link
Contributor

I think i've got it in cudf from exactly where you said. In the map_partitions method we call self._meta.set_index(colname) which eventually calls to https://github.com/rapidsai/cudf/blob/branch-0.6/python/cudf/dataframe/index.py#L541 . as_index comes back with a RangeIndex instead of DateTimeIndex. @kkraus14 do you have thoughts here ?

If I amend as_index to use:

isinstance(arbitrary, DatetimeColumn) or np.issubdtype(arbitrary.dtype, np.datetime64):

the correct index is used but I get assertion failures in the as_column calls in the Series constructor.

Looks like we have a cudf bug in handling empty columns where it's casting them to RangeIndex objects inadvertantly, will open an issue in cudf.

@kkraus14
Copy link
Contributor

rerun tests

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants