Skip to content

Commit

Permalink
dask
Browse files Browse the repository at this point in the history
  • Loading branch information
luweizheng committed Feb 10, 2024
1 parent 245fb5b commit 59a0ec5
Show file tree
Hide file tree
Showing 22 changed files with 1,076 additions and 705 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Deploy

on:
push:
branches:
- main
tags:
- '*'

jobs:
build:
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Distributed Programming with Python

开源的、面向下一代人工智能应用的 Python 分布式编程书籍。
Open Source, Pythonic, Distributed Programming Book for Next-Generation AI Applications

## 内容介绍
## Introduction

Python 已经成为数据科学和人工智能引领性的编程语言,数据科学家常常使用 Python 完成一系列任务。本书主要针对 Python 无法高效并行所设计,重点以数据科学领域为应用场景。
Python has become the de facto programming language for data science and artificial intelligence. Data scientists often using Python to perform a wide range of tasks. This book is primarily focused on addressing the limitations of Python parallelism, with a particular emphasis on applications in the field of data science.

## 参与贡献
## Contributing

如果想参与贡献,请参考下面的两个文件。
If you would like to contribute, please refer to the following two files.

* [构建指南](./contribute/info.md) 页面详细介绍了本书是如何撰写的,包括如何克隆代码仓库,如何创建开发环境,如何部署到 GitHub Pages
The [Build Guide](./contribute/build.md) provides detailed instructions on how this book is written, including cloning the code repository, setting up the development environment, and deploying to GitHub Pages.

* [样式规范](./contribute/style.md) 页面详细介绍了文件命名方式,文字风格,代码规范,画图工具等。
The [Style Guide](./contribute/style.md) provides detailed guidelines on file naming conventions, writing style, code conventions, diagramming tools, and more.
20 changes: 10 additions & 10 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,16 @@ subtrees:
- file: ch-data-science/data-science-lifecycle
- file: ch-data-science/machine-learning
- file: ch-data-science/python-ecosystem
# - file: ch-dask/index
# entries:
# - file: ch-dask/dask-intro
# - file: ch-dask/dask-dataframe-intro
# - file: ch-dask/dask-distributed
# - file: ch-dask/task-graph-partitioning
# - file: ch-dask-dataframe/index
# entries:
# - file: ch-dask-dataframe/dask-pandas
# - file: ch-dask-dataframe/read-write
- file: ch-dask/index
entries:
- file: ch-dask/dask-intro
- file: ch-dask/dask-dataframe-intro
- file: ch-dask/dask-distributed
- file: ch-dask/task-graph-partitioning
- file: ch-dask-dataframe/index
entries:
- file: ch-dask-dataframe/dask-pandas
- file: ch-dask-dataframe/read-write
# - file: ch-ray-core/index
# entries:
# - file: ch-ray-core/ray-intro
Expand Down
3 changes: 0 additions & 3 deletions ch-dask-dataframe/dask-pandas.md

This file was deleted.

2 changes: 2 additions & 0 deletions ch-dask-dataframe/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Dask DataFrame

While pandas has become the standard for DataFrames, it lacks the ability to harness the power of multiple cores and distributed computing. Dask DataFrame aims to address the challenges of parallel computing with pandas. While striving to offer a consistent API with pandas, Dask DataFrame introduces several differences. This chapter assumes that users are already familiar with pandas and focuses on discussing the distinctions between Dask DataFrame and pandas.

```{tableofcontents}
```
843 changes: 511 additions & 332 deletions ch-dask-dataframe/read-write.ipynb

Large diffs are not rendered by default.

52 changes: 23 additions & 29 deletions ch-dask/dask-dataframe-intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@
"metadata": {},
"source": [
"(get-started-dask-dataframe)=\n",
"# Dask DataFrame 快速入门\n",
"# Getting Started with Dask DataFrame\n",
"\n",
"我们先使用 Dask DataFrame 演示一下如何使用 Dask 并行化 pandas DataFrame"
"In this section, we will demonstrate how to parallelize pandas DataFrame using Dask DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 创建 Dask DataFrame\n",
"## Creating Dask DataFrame\n",
"\n",
"使用 Dask 内置的方法创建一个名为 `ddf` 的 `DataFrame`,这份数据是随机生成的,每秒钟生成一个数据样本,共计 4 天(从 2024-01-01 0:00 2024-01-05 0:00)。"
"We can generate a Dask DataFrame named `ddf`, which is a time series dataset that is randomly generated. Each data sample represents one second, totaling four days (from 2024-01-01 0:00 to 2024-01-05 0:00)."
]
},
{
Expand Down Expand Up @@ -129,14 +129,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas 所有操作是立即(Eager)执行的。Dask 是延迟(Lazy)执行的,数据并没有开始计算,所以都用省略号 ... 表示。"
"All operations in pandas are executed immediately (i.e., Eager Execution). Dask, on the other hand, is executed lazily, and the above data has not been computed, hence represented by ellipsis (...)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"虽然 `ddf` 的数据还没有被计算,但 Dask 已经获取了数据的列名和数据类型,用 `dtypes` 查看列:"
"While the data in the Dask DataFrame (`ddf`) has not been computed yet, Dask has already retrieved the column names and data types. You can view this information using the `dtypes` attribute:"
]
},
{
Expand Down Expand Up @@ -167,9 +167,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 执行计算\n",
"## Trigger Computation\n",
"\n",
"如果想计算并得到结果,必须使用 `compute()` 手动触发计算。"
"To compute and obtain results, it is necessary to manually trigger the computation using the `compute()` method."
]
},
{
Expand Down Expand Up @@ -329,7 +329,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Dask DataFrame 有一个重要的内置变量 `npartitions`,它表示将数据切分成了多少份,或者说一共有多少个分区(Partition)。如 {numref}`dask-dataframe-partition` 所示,Dask DataFrame 是由多个 pandas DataFrame 组成的,每个 pandas DataFrame 又被称作一个 Partition。"
"Dask DataFrame has a crucial built-in variable called `npartitions`. It signifies the number of divisions or partitions the data has been split into. As illustrated in {numref}`dask-dataframe-partition`, a Dask DataFrame comprises multiple pandas DataFrames, with each pandas DataFrame referred to as a partition."
]
},
{
Expand Down Expand Up @@ -361,10 +361,9 @@
"width: 400px\n",
"name: dask-dataframe-partition\n",
"---\n",
"Dask DataFrame 是由多个 pandas DataFrame 组成\n",
"A Dask DataFrame comprises multiple pandas DataFrames\n",
"```\n",
"\n",
"每个 Partition 有上界和下界。这个例子中 `ddf` 是根据时间列进行的切分,每天的数据组成一个 Partition。内置变量 `divisions` 存放着每个 Partition 的分界线:"
"Each partition in a Dask DataFrame is defined by upper and lower bounds. In this example, `ddf` is partitioned based on the time column, with each day's data forming a distinct partition. The built-in variable `divisions` holds the boundary lines for each partition:"
]
},
{
Expand Down Expand Up @@ -399,17 +398,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 索引\n",
"## Index\n",
"\n",
":::{note}\n",
"\n",
"pandas `DataFrame` 有一列专门存放索引(Index),Index 可以是数字,比如行号;也可以是时间。Index 列通常只用于索引,不作为数据字段,在 `ddf.dtypes` 中看不到 Index 列。\n",
"\n",
"In a pandas DataFrame, there is a dedicated column for storing the index, which can be numeric, such as row numbers, or temporal. The index column is typically used solely for indexing purposes and is not considered a data field; hence, it is not visible in `ddf.dtypes`.\n",
":::\n",
"\n",
"本例中,`ddf` 的 Index 是时间。每个 Partition 基于 Index 列进行切分。整个 `ddf` 是四天的数据,每个 Partition 是一天的数据。\n",
"In this example, the index of `ddf` is temporal, and each partition is based on this index column. The entire `ddf` spans four days of data, with each partition representing a single day.\n",
"\n",
"现在我们选择 2024-01-01 0:00 2024-01-02 5:00 的数据,横跨了两天,横跨了两个 Partition。"
"Now, let's select data from 2024-01-01 0:00 to 2024-01-02 5:00, spanning two days and two partitions."
]
},
{
Expand Down Expand Up @@ -502,7 +499,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"还是需要使用 `compute()` 来触发计算,得到结果:"
"Use `compute()` to trigger the computation and obtain the results:"
]
},
{
Expand Down Expand Up @@ -658,12 +655,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas Compatibility\n",
"\n",
"## pandas 兼容\n",
"Most operations of Dask DataFrame and pandas are similar, allowing us to employ Dask DataFrame much like we would with pandas.\n",
"\n",
"Dask DataFrame 的大部分操作与 pandas 几乎一致,我们可以像使用 pandas 那样使用 Dask DataFrame。\n",
"\n",
"比如数据过滤和 `groupby`:"
"For instance, data filtering and groupby operations are conducted in a manner analogous to pandas:"
]
},
{
Expand Down Expand Up @@ -697,7 +693,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"现在的结果仍然用省略号 ... 表示,因为计算被延迟执行,需要调用 `compute()` 触发执行。"
"The results are still represented by ellipsis (...) because the computation is deferred and requires invoking `compute()` to trigger execution."
]
},
{
Expand Down Expand Up @@ -752,11 +748,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 计算图\n",
"\n",
"至此,我们知道 Dask DataFrame 将大数据切分成了 Partition,并且延迟执行的。Dask 构建了 Task Graph,来分别对每个 Partition 进行了计算。\n",
"## Computational Graph\n",
"\n",
"执行 `compute()` 之前,Dask 构建的是一个计算图 Task Graph,用 `visualize()` 可视化 Task Graph"
"Until now, we understand that Dask DataFrame divides large datasets into partitions and operates with a deferred execution manner. Before executing `compute()`, Dask has built is a computational Task Graph, and you can visualize this Task Graph using `visualize()`:"
]
},
{
Expand Down Expand Up @@ -1552,7 +1546,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"计算图中,圆圈表示计算,长方形表示数据。对于 Dask DataFrame 来说,长方形就是 pandas `DataFrame`。"
"In the computational Task Graph, circles represent computations, and rectangles represent data. For Dask DataFrame, the rectangles correspond to pandas DataFrame instances."
]
}
],
Expand Down
Loading

0 comments on commit 59a0ec5

Please sign in to comment.