-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Fix #28 * Fix #26, #29, #31 * Fix #38 * Add `str_dtype` argument to `as_character()` to partially fix #36 * 0.3.2 * Delete grouped2.py
- Loading branch information
Showing
19 changed files
with
243 additions
and
159 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,4 +4,4 @@ | |
from .core import _frame_format_patch | ||
from .core.defaults import f | ||
|
||
__version__ = "0.3.1" | ||
__version__ = "0.3.2" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
|
||
- dtype | ||
|
||
`NA` in datar sets to `numpy.nan`, which is a float. So that it causes problems for other dtypes of data, because setting a value to NA (float) in an array with other dtype is not compatible. Unlink R, python does not have missing value type for other dtypes. | ||
|
||
pandas has introduced it's own `NA` and some `NA` compatible dtypes. However, `numpy` is still not aware of it, which causes problems for internal computations. | ||
|
||
- string | ||
|
||
When initialize a string array intentionally: `numpy.array(['a', NA])`, the `NA` will be converted to a string `'nan'`. That may not be what we want sometimes. To avoid that, use `None` or `NULL` instead: | ||
|
||
```python | ||
>>> numpy.array(['a', None]) | ||
array(['a', None], dtype=object) | ||
``` | ||
|
||
Just pay attention that the dtype falls back to object. | ||
|
||
|
||
- `NaN` | ||
|
||
Since `NA` is already a float, `NaN` here is equivalent to `NA`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
|
||
Most APIs from tidyverse packages ignore/reset the index (row names) of data frames, so do the APIs from `datar`. So when selecting rows, row indices are always used. With most APIs, the indices of the data frames are dropped, so they are actually ranging from 0 to `nrow(df) - 1`. | ||
|
||
!!! Note | ||
|
||
when using 1-based indexing (default), 1 selects the first row. Even though the first row shows index 0 when it's printed. | ||
|
||
No `MultiIndex` indices/column names are supported for the APIs to select or manipulate data frames and the data frames generated by the APIs will not have `MultiIndex` indices/column names. However, since it's still pandas DataFrame, you can always do it in pandas way: | ||
|
||
```python | ||
df = tibble(x=1, y=2) | ||
df2 = df >> mutate(z=f.x+f.y) | ||
# pandas way to select | ||
df2.iloc[0, z] # 3 | ||
# add multiindex to it: | ||
df.columns = pd.MultiIndex.from_product([df.columns, ['C']]) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
|
||
`datar` doesn't use `pandas`' `DataFrameGroupBy`/`SeriesGroupBy` classes. Instead, we have our own `DataFrameGroupBy` class, which is actually a subclass of `DataFrame`, with 3 extra properties: `_group_data`, `_group_vars` and `_group_drop`, carring the grouping data, grouping variables/columns and whether drop the non-observable values. This is very similar to `grouped_df` from `dplyr`. | ||
|
||
The reasons that we implement this are: | ||
|
||
1. Pandas DataFrameGroupBy cannot handle mutilpe categorical columns as | ||
groupby variables with non-obserable values | ||
2. It is very hard to retrieve group indices and data when doing apply | ||
3. NAs unmatched in grouping variables |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
`%in%` in R is a shortcut for `is.element()` to test if the elements are in a container. | ||
|
||
```r | ||
r$> c(1,3,5) %in% 1:4 | ||
[1] TRUE TRUE FALSE | ||
|
||
r$> is.element(c(1,3,5), 1:4) | ||
[1] TRUE TRUE FALSE | ||
``` | ||
|
||
However, `in` in python acts differently: | ||
|
||
```python | ||
>>> import numpy as np | ||
>>> | ||
>>> arr = np.array([1,2,3,4]) | ||
>>> elts = np.array([1,3,5]) | ||
>>> | ||
>>> elts in arr | ||
/.../bin/bpython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future. | ||
#!/.../bin/python | ||
False | ||
>>> [1,2] in [1,2,3] | ||
False | ||
``` | ||
|
||
It simply tests if the element on the left side of `in` is equal to any of the elements in the right side. Regardless of whether the element on the left side is scalar or not. | ||
|
||
Yes, we can redefine the behavior of this by writing your own `__contains__()` methods of the right object. For example: | ||
|
||
```python | ||
>>> class MyList(list): | ||
... def __contains__(self, key): | ||
... # Just an example to let it return the reversed result | ||
... return not super().__contains__(key) | ||
... | ||
>>> 1 in MyList([1,2,3]) | ||
False | ||
>>> 4 in MyList([1,2,3]) | ||
True | ||
``` | ||
|
||
But the problem is that the result `__contains__()` is forced to be a scalar bool by python. In this sense, we cannot let `x in y` to be evaluated as a bool array or even a pipda `Expression` object. | ||
```python | ||
>>> class MyList(list): | ||
... def __contains__(self, key): | ||
... # Just an example | ||
... return [True, False, True] # logically True in python | ||
... | ||
>>> 1 in MyList([1,2,3]) | ||
True | ||
>>> 4 in MyList([1,2,3]) | ||
True | ||
``` | ||
|
||
So instead, we ported `is.element()` from R: | ||
|
||
```python | ||
>>> import numpy as np | ||
>>> from datar.base import is_element | ||
>>> | ||
>>> arr = np.array([1,2,3,4]) | ||
>>> elts = np.array([1,3,5]) | ||
>>> | ||
>>> is_element(elts, arr) | ||
>>> is_element(elts, arr) | ||
array([ True, True, False]) | ||
``` | ||
|
||
So, as @rleyvasal pointed out in https://github.com/pwwang/datar/issues/31#issuecomment-877499212, | ||
|
||
if the left element is a pandas `Series`: | ||
```python | ||
>>> import pandas as pd | ||
>>> pd.Series(elts).isin(arr) | ||
0 True | ||
1 True | ||
2 False | ||
dtype: bool | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
|
||
R's list is actually a name-value pair container. When there is a need for it, we use python's dict instead, since python's list doesn't support names. | ||
|
||
For example: | ||
```python | ||
>>> names({'a':1}, 'x') | ||
{'x': 1} | ||
``` | ||
|
||
We have `base.c()` to mimic `c()` in R, which will concatenate and flatten anything passed into it. Unlike `list()` in python, it accepts multiple arguments. So that you can do `c(1,2,3)`, but you cannot do `list(1,2,3)` in python. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
|
||
pandas DataFrame doesn't support nested data frames. However, some R packages do, especially `tidyr`. | ||
|
||
Here we uses fake nested data frames: | ||
|
||
```python | ||
>>> df = tibble(x=1, y=tibble(a=2, b=3)) | ||
>>> df | ||
x y$a y$b | ||
<int64> <int64> <int64> | ||
0 1 2 3 | ||
``` | ||
|
||
Now `df` is a fake nested data frame, with an inner data frame as column `y` in `df`. | ||
|
||
!!! Warning | ||
|
||
For APIs from `tidyr` that tidies nested data frames, this is fully supported, but just pay attention when you operate it in pandas way. For other APIs, this feature is still experimental. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
|
||
Unlike some APIs from `tidyverse` packages that uses a data frame as `ptypes` tempate, here we use dtypes directly or a dict with name-dtype pairs for the columns. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
|
||
`datar` introduced `tibble` package as well. | ||
|
||
However, unlike in R, `tidyverse`'s `tibble` is a different class than the `data.frame` from base R, the data frame created by `datar.tibble.tibble()` and family is actually a pandas `DataFrame`. It's just a wrapper around the constructor. | ||
|
||
So you can do anything you do using pandas API after creation. |
Oops, something went wrong.