diff --git a/README.md b/README.md index 5f6004ff..b4e2c712 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,8 @@ # datar -Port of R data packages ([tidyr][1], [dplyr][2], [tibble][4], etc) in python, using [pipda][3]. +Port of R data packages (especially from tidyverse): [tidyr][1], [dplyr][2], [tibble][4] and so on in python, using [pipda][3]. + +Unlike other similar packages in python that just mimic the piping sign, `datar` follows the API designs from the original packages as possible. So that nearly no extra effort is needed for those who are familar with those R packages to transition to python. @@ -12,10 +14,6 @@ Port of R data packages ([tidyr][1], [dplyr][2], [tibble][4], etc) in python, us pip install -U datar ``` -## Philosophy -- Try to keep APIs with the original ones from those R packages -- Try not to change python's default behaviors (i.e, 0-based indexing) - ## Example usage ```python diff --git a/docs/TODO.md b/docs/TODO.md new file mode 100644 index 00000000..eae26f46 --- /dev/null +++ b/docs/TODO.md @@ -0,0 +1,3 @@ + +- Add tests for tidyr from original tidyverse/tidyr cases +- Add more tests for base/core diff --git a/docs/datasets.md b/docs/datasets.md new file mode 100644 index 00000000..0d95a1f1 --- /dev/null +++ b/docs/datasets.md @@ -0,0 +1,82 @@ + +Datasets have to be imported individually by: +```python +from datar.datasets import iris + +# or +from datar import datasets + +iris = datasets.iris +``` + +To list all avaiable datasets: + +```python +from datar import datasets +print(datasets.all_datasets()) + +# {'airquality': {'file': PosixPath('/path/to/datar/datasets/airquality.csv.gz'), +# 'index': False}, +# 'anscombe': {'file': PosixPath('/path/to/datar/datasets/anscombe.csv.gz'), +# 'index': False}, +# 'band_instruments': {'file': PosixPath('/path/to/datar/datasets/band_instruments.csv.gz'), +# 'index': False}, +# 'band_instruments2': {'file': PosixPath('/path/to/datar/datasets/band_instruments2.csv.gz'), +# 'index': False}, +# 'band_members': {'file': PosixPath('/path/to/datar/datasets/band_members.csv.gz'), +# 'index': False}, +# 'billboard': {'file': PosixPath('/path/to/datar/datasets/billboard.csv.gz'), +# 'index': False}, +# 'construction': {'file': PosixPath('/path/to/datar/datasets/construction.csv.gz'), +# 'index': False}, +# 'diamonds': {'file': PosixPath('/path/to/datar/datasets/diamonds.csv.gz'), +# 'index': False}, +# 'fish_encounters': {'file': PosixPath('/path/to/datar/datasets/fish_encounters.csv.gz'), +# 'index': False}, +# 'iris': {'file': PosixPath('/path/to/datar/datasets/iris.csv.gz'), +# 'index': False}, +# 'mtcars': {'file': PosixPath('/path/to/datar/datasets/mtcars.indexed.csv.gz'), +# 'index': True}, +# 'population': {'file': PosixPath('/path/to/datar/datasets/population.csv.gz'), +# 'index': False}, +# 'relig_income': {'file': PosixPath('/path/to/datar/datasets/relig_income.csv.gz'), +# 'index': False}, +# 'smiths': {'file': PosixPath('/path/to/datar/datasets/smiths.csv.gz'), +# 'index': False}, +# 'starwars': {'file': PosixPath('/path/to/datar/datasets/starwars.csv.gz'), +# 'index': False}, +# 'state_abb': {'file': PosixPath('/path/to/datar/datasets/state_abb.csv.gz'), +# 'index': False}, +# 'state_division': {'file': PosixPath('/path/to/datar/datasets/state_division.csv.gz'), +# 'index': False}, +# 'state_region': {'file': PosixPath('/path/to/datar/datasets/state_region.csv.gz'), +# 'index': False}, +# 'storms': {'file': PosixPath('/path/to/datar/datasets/storms.csv.gz'), +# 'index': False}, +# 'table1': {'file': PosixPath('/path/to/datar/datasets/table1.csv.gz'), +# 'index': False}, +# 'table2': {'file': PosixPath('/path/to/datar/datasets/table2.csv.gz'), +# 'index': False}, +# 'table3': {'file': PosixPath('/path/to/datar/datasets/table3.csv.gz'), +# 'index': False}, +# 'table4a': {'file': PosixPath('/path/to/datar/datasets/table4a.csv.gz'), +# 'index': False}, +# 'table4b': {'file': PosixPath('/path/to/datar/datasets/table4b.csv.gz'), +# 'index': False}, +# 'table5': {'file': PosixPath('/path/to/datar/datasets/table5.csv.gz'), +# 'index': False}, +# 'us_rent_income': {'file': PosixPath('/path/to/datar/datasets/us_rent_income.csv.gz'), +# 'index': False}, +# 'warpbreaks': {'file': PosixPath('/path/to/datar/datasets/warpbreaks.csv.gz'), +# 'index': False}, +# 'who': {'file': PosixPath('/path/to/datar/datasets/who.csv.gz'), +# 'index': False}, +# 'world_bank_pop': {'file': PosixPath('/path/to/datar/datasets/world_bank_pop.csv.gz'), +# 'index': False}} +``` + +`file` shows the path to the csv file of the dataset, and `index` shows if it has index (rownames). + +!!! Note + + The column names are altered by replace `.` to `_`. For example `Sepal.Width` to `Sepal_Width`. diff --git a/docs/f.md b/docs/f.md new file mode 100644 index 00000000..56cc8fc7 --- /dev/null +++ b/docs/f.md @@ -0,0 +1,39 @@ +## The `Symbolic` object `f` + +You can import it by `from datar.core import f`, or `from datar.all import *` + +`f` is a universal `Symbolic` object, which does the magic to connect the expressions in verb arguments so that they can be delayed to execute. + +There are different uses for the `f`. + +- Use as a proxy to refer to dataframe columns (i.e. `f.x`, `f['x']`) +- Use as a slice container. For example: + - `f[:3]` for `range(0,3)` + - `f[f.x:f.z]` for columns from `x` to `z`, inclusively. If you want to exclude the `stop` column: `f[f.x:f.z:0]` +- Use as the column name marker for `tribble`: + ```python + tribble( + f.x, f.y + 1, 2 + 3, 4 + ) + ``` + +Sometimes if you have mixed verbs with piping and you want to distinguish to proxies for different verbs: + +```python +# you can just replicate f with a different name +g = f + +df = tibble(x=1, y=2) +df >> left_join(df >> group_by(f.x), by=g.y) +``` + +Or you can instantiate a new `Symbolic` object: +```python +from pipda.symbolic import Symbolic + +g = Symbolic() + +# f and g make no difference in execution technically +``` diff --git a/docs/import.md b/docs/import.md new file mode 100644 index 00000000..6d67e50f --- /dev/null +++ b/docs/import.md @@ -0,0 +1,46 @@ +## Import submodule, verbs and functions from datar + +You can import everything (all verbs and functions) from datar by: +```python +from datar.all import * +``` + +which is not recommended. Instead, you can import individual verbs or functions by: +```python +from datar.all import mutate +``` + +!!! Attention + + When you use `from datar.all import *`, you need to pay attention to the python builtin names that are covered by `datar`. For example, `slice` will be `datar.dplyr.slice` instead of `builtins.slice`. To refer to the builtin one, you need to: + ```python + import builtins + + s = builtins.slice(None, 3, None) # [:3] + ``` + +Or if you know the origin of the verb, you can also do: +```python +from datar.dplyr import mutate +``` + +You can also keep the namespace: +```python +from datar import dplyr + +# df = tibble(x=1) +# then use it with the dplyr namespace: +df >> dplyr.mutate(y=2) +``` + +## Import datasets from datar + +Note that `from datar.all import *` will not import datasets + +!!! note + + Dataset has to be imported individually. So that `from datar.datasets import *` won't work. + +You don't have to worry about other datasets to be imported and take up the memory when you import one. The dataset is only loaded into memory when you explictly import it individually. + +See also [datasets](../datasets) for details about available datasets. diff --git a/docs/piping_vs_regular.md b/docs/piping_vs_regular.md new file mode 100644 index 00000000..87a28264 --- /dev/null +++ b/docs/piping_vs_regular.md @@ -0,0 +1,50 @@ + +A verb can be called in a piping form: +```python +df >> verb(...) +``` + +Or in a regular way: +```python +verb(df, ...) +``` + +The piping is recommended and is designed specially to enable full features of `datar`. + +The regular form of verb calling is limited when an argument is calling a function that is registered requiring the data argument. For example: + +```python +df >> head(n=10) +head(df, n=10) # same +``` + +However, +```python +df >> select(everything()) # works +select(df, everything()) # not working +``` +Since `everything` is registered requiring the first argument to be a data frame. With the regular form, we are not able (or need too much effort) to obtain the data frame, but for the piping form, `pipda` is designed to pass the data piped to the verb and every argument of it. + +The functions registered by `register_func` are supposed to be used as arguments of verbs. However, they have to be used with the right signature. For example, `everything` signature has `_data` as the first argument, to be called regularly: +```python +everything(df) +# everything() not working, everything of what? +``` + +When the functions are registered by `register_func(None, ...)`, which does not require the data argument, they are able to be used in regular form: + +```python +from datar.core import f +from datar.base import abs +from datar.tibble import tibble +from datar.dplyr import mutate + +df = tibble(x=[-1,-2,-3]) +df >> mutate(y=abs(f.x)) +# x y +# 0 -1 1 +# 1 -2 2 +# 2 -3 3 + +mutate(df, abs(f.x)) # works the same way +``` diff --git a/mkdocs.yml b/mkdocs.yml index cf40f679..38169e72 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -23,6 +23,10 @@ extra_css: - style.css nav: - 'Home': 'index.md' + - 'Import': 'import.md' + - 'The f': 'f.md' + - 'Piping vs regular calling': 'piping_vs_regular.md' + - 'Datasets': 'datasets.md' - 'API': 'mkapi/api/datar' - 'Examples': 'across': 'notebooks/across.ipynb' @@ -76,4 +80,5 @@ nav: 'uncount': 'notebooks/uncount.ipynb' 'unite': 'notebooks/unite.ipynb' 'with_groups': 'notebooks/with_groups.ipynb' + - 'TODO': 'TODO.md' - 'Change Log': CHANGELOG.md