add test for timeseries index #135

quasiben · 2019-03-06T02:24:01Z

adds a test for https://github.com/rapidsai/dask-cudf/issues/129

mrocklin

This works already with the upstream changes I take it?

dask_cudf/tests/test_core.py

quasiben · 2019-03-06T14:51:07Z

I'm seeing something odd with dtype handling in dask-cudf and not cudf. cudf.set_index(...) does the right thing with dtype handling and the newly created index has type <M8[ms], whereas, the equivalent call: dask_cudf.set_index(...) has an index of dtype int64. Also note, when calling dask_cudf.head` the correct dtype is returned

In [2]: import dask.dataframe as dd                                                                                                                                                     

In [3]: import cudf 
   ...: import pandas as pd 
   ...: import numpy as np 
   ...: gdf = cudf.DataFrame() 
   ...: gdf['date'] = pd.date_range('11/20/2018', periods=72, freq='D') 
   ...: gdf['value'] = np.random.sample(len(gdf)) 
   ...:  
   ...: ddf = dd.from_pandas(gdf, npartitions=2) 
   ...:  
   ...: ddf_ts_idx = ddf.set_index('date')                                                                                                                                              

In [4]: gdf_idx = gdf.set_index('date')                                                                                                                                                 

In [5]: gdf_idx.index.dtype                                                                                                                                                             
Out[5]: dtype('<M8[ms]')

In [6]: ddf_ts_idx.index.dtype                                                                                                                                                          
Out[6]: dtype('int64')

In [7]: ddf_ts_idx.index.head()                                                                                                                                                         
Out[7]: DatetimeIndex([numpy.datetime64('2018-11-20T00:00:00.000'), numpy.datetime64('2018-11-21T00:00:00.000'), numpy.datetime64('2018-11-22T00:00:00.000'), numpy.datetime64('2018-11-23T00:00:00.000'), numpy.datetime64('2018-11-24T00:00:00.000')], dtype=datetime64[ms])

In [10]: ddf_ts_idx.index.head().dtype                                                                                                                                                  
Out[10]: dtype('<M8[ms]')

mrocklin · 2019-03-06T15:12:51Z

You would want to track the ._meta attribute in the various stages of the set_index operation in dask-cudf and see when it goes wrong. I have to admit that I don't know much about dask-cudf's set_index operation. It relies on a bit of code in sort_values that is unique to dask-cudf that I haven't invested the time in understanding yet.

quasiben · 2019-03-06T15:26:50Z

I think i've got it in cudf from exactly where you said. In the map_partitions method we call self._meta.set_index(colname) which eventually calls to https://github.com/rapidsai/cudf/blob/branch-0.6/python/cudf/dataframe/index.py#L541 . as_index comes back with a RangeIndex instead of DateTimeIndex. @kkraus14 do you have thoughts here ?

If I amend as_index to use:

isinstance(arbitrary, DatetimeColumn) or np.issubdtype(arbitrary.dtype, np.datetime64):

the correct index is used but I get assertion failures in the as_column calls in the Series constructor.

mike-wendt · 2019-03-14T05:07:35Z

If not targeted for v0.6 release please change the base to branch-0.7, thanks!

kkraus14 · 2019-05-15T16:20:40Z

rerun tests

kkraus14 · 2019-05-15T18:38:57Z

I think i've got it in cudf from exactly where you said. In the map_partitions method we call self._meta.set_index(colname) which eventually calls to https://github.com/rapidsai/cudf/blob/branch-0.6/python/cudf/dataframe/index.py#L541 . as_index comes back with a RangeIndex instead of DateTimeIndex. @kkraus14 do you have thoughts here ?

If I amend as_index to use:
isinstance(arbitrary, DatetimeColumn) or np.issubdtype(arbitrary.dtype, np.datetime64):
the correct index is used but I get assertion failures in the as_column calls in the Series constructor.

Looks like we have a cudf bug in handling empty columns where it's casting them to RangeIndex objects inadvertantly, will open an issue in cudf.

kkraus14 · 2019-05-16T23:04:01Z

rerun tests

quasiben added 2 commits March 5, 2019 18:17

add test for timeseries index

d5ca5ae

fix test and lint

8f2dc34

mrocklin reviewed Mar 6, 2019

View reviewed changes

dask_cudf/tests/test_core.py Outdated Show resolved Hide resolved

dask_cudf/tests/test_core.py Outdated Show resolved Hide resolved

mike-wendt changed the base branch from master to branch-0.6 March 14, 2019 05:07

kkraus14 changed the base branch from branch-0.6 to branch-0.8 May 14, 2019 20:46

kkraus14 self-assigned this May 14, 2019

Use dask dataframe api and pass dask dataframe into assert_eq

0eefafc

Use dask.dataframe, not dask-cudf.dataframe

69c3e42

kkraus14 mentioned this pull request May 15, 2019

[BUG] Empty columns are converted to RangeIndex in df.set_index(...) function rapidsai/cudf#1753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add test for timeseries index #135

add test for timeseries index #135

quasiben commented Mar 6, 2019

mrocklin left a comment

quasiben commented Mar 6, 2019

mrocklin commented Mar 6, 2019

quasiben commented Mar 6, 2019 •

edited

Loading

mike-wendt commented Mar 14, 2019

kkraus14 commented May 15, 2019

kkraus14 commented May 15, 2019

kkraus14 commented May 16, 2019

add test for timeseries index #135

Are you sure you want to change the base?

add test for timeseries index #135

Conversation

quasiben commented Mar 6, 2019

mrocklin left a comment

Choose a reason for hiding this comment

quasiben commented Mar 6, 2019

mrocklin commented Mar 6, 2019

quasiben commented Mar 6, 2019 • edited Loading

mike-wendt commented Mar 14, 2019

kkraus14 commented May 15, 2019

kkraus14 commented May 15, 2019

kkraus14 commented May 16, 2019

quasiben commented Mar 6, 2019 •

edited

Loading