Skip to content

Commit

Permalink
parallel programming & data science (#3)
Browse files Browse the repository at this point in the history
  • Loading branch information
luweizheng authored Aug 9, 2024
1 parent 7e7e1d7 commit 2120df9
Show file tree
Hide file tree
Showing 78 changed files with 3,791 additions and 1,531 deletions.
14 changes: 8 additions & 6 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,19 @@ root: index
subtrees:
- numbered: 2
entries:
- file: ch-intro/index
- file: ch-parallel-computing/index
entries:
- file: ch-intro/computer-architecture
- file: ch-intro/serial-parallel
- file: ch-intro/thread-process
- file: ch-intro/parallel-program-design
- file: ch-intro/performance-metrics
- file: ch-parallel-computing/computer-architecture
- file: ch-parallel-computing/serial-parallel
- file: ch-parallel-computing/thread-process
- file: ch-parallel-computing/parallel-program-design
- file: ch-parallel-computing/performance-metrics
- file: ch-data-science/index
entries:
- file: ch-data-science/data-science-lifecycle
- file: ch-data-science/machine-learning
- file: ch-data-science/deep-learning
- file: ch-data-science/hyperparameter
- file: ch-data-science/python-ecosystem
- file: ch-dask/index
entries:
Expand Down
6 changes: 4 additions & 2 deletions ch-dask-dataframe/read-write.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"(dask-dataframe-read-write)=\n",
"(sec-dask-dataframe-read-write)=\n",
"# Reading and Writing Data\n",
"\n",
"Dask DataFrame supports nearly all data reading and writing operations available in pandas. This includes reading and writing text files, Parquet, HDF, JSON, and other formats from local, NFS, HDFS, or S3 storage. {numref}`dask-read-write-operations` illustrates some common reading and writing operations.\n",
Expand Down Expand Up @@ -5560,7 +5560,9 @@
"* Embedded schema\n",
"* Data compression\n",
"\n",
"As shown in {numref}`parquet-with-row-group, columnar storage organizes data by columns instead of rows. Data analysts are usually interested in specific columns rather than all of them. Parquet allows filtering of unnecessary columns during data reading, resulting in enhanced performance by reducing the amount of data to be read. Parquet is extensively used in the Apache Spark, Apache Hive, and Apache Flink ecosystems. Parquet inherently includes schema information, embedding metadata such as column names and data types for each column within each Parquet file. This feature eliminates inaccuracies in data types inferring of Dask DataFrame. Data in Parquet is compressed, making it more space-efficient compared to CSV.\n",
"As shown in {numref}`parquet-with-row-group, columnar storage organizes data by columns instead of rows. Data analysts are usually interested in specific columns rather than all of them. Parquet allows filtering of unnecessary columns during data reading. Selecting necessary columns is referred to as **column pruning**, a technique that reduces data processing overhead and is one of the optimization techniques commonly employed in data engineering. Apart from column pruning, **row pruning** is another technique utilized to reduce data processing overhead. Parquet has schema information, and it embeds metadata such as column names and data types for each column within each Parquet file. This feature eliminates inaccuracies in data types inferring of Dask DataFrame. Data in Parquet is compressed, making it more space-efficient compared to CSV.\n",
"\n",
"Parquet is extensively used in the Apache Spark, Apache Hive, and Apache Flink ecosystems.\n",
"\n",
"\n",
"```{figure} ../img/ch-dask-dataframe/parquet.svg\n",
Expand Down
2 changes: 1 addition & 1 deletion ch-dask/dask-distributed.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -489,7 +489,7 @@
"\n",
"## Dask Nanny\n",
"\n",
"When Dask clusters are launched, in addition to starting the Dask Scheduler and Dask Worker, a monitoring service called Dask Nanny is also initiated. Just like its name, Dask Nanny monitors the CPU and memory usage of Dask Workers to prevent them from exceeding resource limits. If a Dask Worker crashes, Dask Nanny restarts it. If a Dask Worker is restarted by Dask Nanny, the computation tasks on that worker are re-executed. Other Dask Workers wait for this worker to recover to the point just before the crash. However, this can significantly burden other Dask Workers, as other Dask Workers must wait or recompute."
"When Dask clusters are launched, in addition to starting the Dask Scheduler and Dask Worker, a monitoring service called Dask Nanny is also initiated. Just like its name, Dask Nanny monitors the CPU and memory usage of Dask Workers to prevent them from exceeding resource limits. If a Dask Worker crashes, Dask Nanny restarts it. If a Dask Worker is restarted by Dask Nanny, the computation tasks on that worker are re-executed. Other Dask Workers hold on the data and wait for this worker to recover. This can significantly burden other Dask Workers. If Dask workers are frequently restarting, you should consider adjusting the data size of each partition using `rechunk()` or `repartition()`."
]
}
],
Expand Down
6 changes: 3 additions & 3 deletions ch-dask/task-graph-partitioning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -531,7 +531,7 @@
"\n",
"If data blocks are too large, Dask Workers are prone to running out of memory (OOM) because an individual Dask Worker cannot handle the large data block. When faced with OOM, Dask spills some data to disk. If the computation still cannot be completed after spilling, the Dask Worker may be restarted, potentially leading to repeated restarts.\n",
"\n",
"### Iterative Algorithms\n",
"## Iterative Algorithms\n",
"\n",
"Iterative algorithms typically use loops, i.e., the current iteration depends on the data from the previous iterations. Dask's Task Graph does not handle iterative algorithms well. Each data dependency adds a directed edge to the Task Graph, which can make the Task Graph very large and cause slow execution speed. For example, many machine learning algorithms and SQL JOIN operations are based on iterative algorithms.\n",
"\n",
Expand Down Expand Up @@ -561,7 +561,7 @@
"```\n",
"We should pay attention to the Task Stream column and avoid having a large amount of white space or a significant amount of red. White space indicates that there are no tasks on a Dask Worker, while red indicates substantial data exchange between Dask Workers.\n",
"\n",
"The comparison between {numref}`dask-good-partitions` and {numref}`dask-too-many-partitions` illustrates the point. Both images use the same code ({numref}`dask-read-write` example), but with different data block sizes. In {numref}`dask-too-many-partitions`, where the data blocks are too small, the Task Graph is excessively large, leading to a significant amount of red. This means time is not spent on computation but is instead wasted on tasks like data exchange.\n",
"The comparison between {numref}`dask-good-partitions` and {numref}`dask-too-many-partitions` illustrates the point. Both images use the same code ({numref}`sec-dask-dataframe-read-write` example), but with different data block sizes. In {numref}`dask-too-many-partitions`, where the data blocks are too small, the Task Graph is excessively large, leading to a significant amount of red. This means time is not spent on computation but is instead wasted on tasks like data exchange.\n",
"\n",
"```{figure} ../img/ch-dask/good-partitions.png\n",
"---\n",
Expand Down Expand Up @@ -620,7 +620,7 @@
"\n",
"Dask Array and Dask DataFrame both provide ways to set the data block size.\n",
"\n",
"You can specify the size of each data block during initialization, for example: `x = da.ones((10, 10), chunks=(5, 5))`. The `chunks` parameter is used to set the size of each data block. It's also possible to adjust it during program execution, using methods like Dask Array's [`rechunk()`](https://docs.dask.org/en/latest/generated/dask.array.rechunk.html) and Dask DataFrame's [`repartition()`](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html)."
"You can specify the size of each data block during initialization, for example: `x = da.ones((10, 10), chunks=(5, 5))`. The `chunks` parameter is used to set the size of each data block. It's also possible to adjust it during program execution, using methods like Dask Array's [`rechunk()`](https://docs.dask.org/en/latest/generated/dask.array.rechunk.html) and Dask DataFrame's [`repartition()`](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.repartition.html). In Dask Array, you can use the `rechunk(chunks=...)` method to set the size of data chunks after the data is created, and the `chunks` parameter can be an `int` indicating the number of data chunks to split into, or it can be a tuple like `(5, 10, 20)`, representing the dimensions of a single data chunk. In Dask DataFrame, you can use the `repartition(npartitions=...)` method to set the number of data partitions."
]
},
{
Expand Down
116 changes: 116 additions & 0 deletions ch-data-science/deep-learning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
(sec-deep-learning-intro)=
# Deep Learning

## Deep Neural Networks

Deep learning is a shorthand term for deep neural networks. Basically, a neural network is composed of many instances of the following formula, and a deep neural network is made up of a stack of many neural network layers.

$$
\begin{aligned}
\boldsymbol{z} &= \boldsymbol{W} \cdot \boldsymbol{x} + \boldsymbol{b} \\
\boldsymbol{a} &= f(\boldsymbol{z})
\end{aligned}
$$

Here, $\boldsymbol{x}$ is the input, $\boldsymbol{W}$ represents the parameters (also known as weights) of the neural network. The training of a neural network involves continuously updating the parameters $\boldsymbol{W}$. Once trained, the model can be used for inference and predicting unknown data.

$f$ represents the activation function. The multiplication of $\boldsymbol{W}$ and $\boldsymbol{x}$ is a linear transformation. Even with multiple multiplications combined, it remains a linear transformation. In other words, a multi-layer network without an activation function would degrade into a single-layer linear model. Activation functions introduce non-linearity, allowing multi-layer neural networks to theoretically fit any input-output pattern. From a biological perspective, activation functions serve to activate or deactivate certain neurons. Common activation functions include Sigmoid and ReLU. We have visualized the Sigmod function in {numref}`sec-machine-learning-intro`, and the formula for ReLU is: $ f(x) = \max (0, x) $.

## Forward Propagation

{numref}`fig-forward-pass` represents the simplest form of a neural network: stacking $\boldsymbol{z^{[n]}} = \boldsymbol{W^{[n]}} \cdot \boldsymbol{a^{[n-1]}} + \boldsymbol{b^{[n]}}$ and $\boldsymbol{a^{[n]}} = f(\boldsymbol{z^{[n]}})$, where the previous layer's output $\boldsymbol{a^{[n-1]}}$ becomes the input of the next layer. This type of network is known as a feedforward neural network (FFN) or a multilayer perceptron (MLP). To make different layers clear, square bracket superscripts are used to differentiate between layers. For example, $\boldsymbol{a^{[1]}}$ represents the output of the first layer, and $\boldsymbol{W^{[1]}}$ represents the parameters of the first layer.

```{figure} ../img/ch-data-science/forward-pass.svg
---
width: 800px
name: fig-forward-pass
---
Forward propagation in a neural network
```

{numref}`fig-forward-pass` illustrates the process of forward propagation in a neural network. Assuming the input $\boldsymbol{x}$ is a 3-dimensional vector, each circle in {numref}`fig-forward-pass` represents an element (a scalar value) of the vector. The diagram also demonstrates the vectorized calculation of $\boldsymbol{a^{[1]}}$ in the first layer and the scalar calculation of $z^{[1]}_1$. In practice, modern processors' vectorized engines are often utilized for such computations.

## Backpropagation

The training of a neural network is updating the $\boldsymbol{W}$ and $\boldsymbol{b}$ of each layer.

First, initialize the $\boldsymbol{W}$ and $\boldsymbol{b}$ of each layer using some random initialization method, such as initializing them from a normal distribution.

Then, define a loss function $L$. The loss function measures the difference between the predicted values $\hat{y}$ of the neural network and the true values $y$. The goal of training is to minimize the loss function. For example, in a housing price prediction case, the squared error is commonly used as the loss function, where the loss function for a single sample is defined as $L = (y - \hat{y})^2$.

Next, calculate the derivatives of the loss function with respect to the parameters of each layer. The derivatives of $L$ with respect to the $\boldsymbol{W^{[l]}}$ and $\boldsymbol{b^{[l]}}$ of the $l$-th layer are denoted as $\frac{\partial L}{\partial \boldsymbol{W^{[l]}}}$ and $\frac{\partial L}{\partial \boldsymbol{b^{[l]}}}$, respectively. The $\boldsymbol{W^{[l]}}$ and $\boldsymbol{b^{[l]}}$ are then updated using the following formulas:

$$
\begin{aligned}
\boldsymbol{W^{[l]}} &= \boldsymbol{W^{[l]}}-\alpha\frac{\partial L}{\partial \boldsymbol{W^{[l]}}} \\
\boldsymbol{b^{[l]}} &= \boldsymbol{b^{[l]}}-\alpha\frac{\partial L}{\partial \boldsymbol{b^{[l]}}}\
\end{aligned}
$$

Here, $\alpha$ is the learning rate, which controls the speed of parameter updates. If the learning rate is too large, the algorithm may oscillate and fail to converge. If the learning rate is too small, the convergence speed may be too slow.

The derivatives of each layer are also referred to as gradients. The parameters are updated in the descenting direction of the gradients, which is known as gradient descent. When computing the derivatives of each layer, it starts from the loss function and calculates the gradients layer by layer in a backward manner, using the chain rule. {numref}`fig-back-propagation` illustrates the process of backpropagation in a neural network.

```{figure} ../img/ch-data-science/back-propagation.svg
---
width: 800px
name: fig-back-propagation
---
Backpropagation in a neural network
```

## Hyperparameters

During the neural network training phase, several parameters need to be manually set before training the model. These parameters cannot be learned automatically through the model's backpropagation and require manual selection and adjustment. These parameters are called hyperparameters, and their selection is usually based on experience or trial and error. Here are some examples of hyperparameters:

* Learning rate, which was mentioned earlier as $\alpha$, controls the step size of each parameter update.
* Network architecture: the number of layers in the model, the number of neurons in each layer, the choice of activation functions, etc. Different network architectures may have different performance for different tasks.

## Implementation Details

Neural network training includes the following three steps:

1. Forward pass
2. Backward pass (backpropagation)
3. Update of the model's weights

{numref}`fig-model-training-input-output` illustrates the inputs and outputs for the training of the i-th layer of a neural network, as outlined in the three steps above.

```{figure} ../img/ch-data-science/model-training-input-output.svg
---
width: 800px
name: fig-model-training-input-output
---
Forward, Backward, and Model Weight Update: Inputs and Outputs
```

### Forward Pass

- **Input:** The input for forward pass consists of two parts: the output of layer $ i-1 $, denoted as $ \boldsymbol{a^{[i-1]}} $, and the model weights and biases for layer $ i $, denoted as $ \boldsymbol{W^{[i]}} $ and $ \boldsymbol{b^{[i]}} $, respectively.
- **Output:** The output, also known as the activation, is produced after applying the weights and biases to the input and passing it through an activation function.

### Backward Propagation

- **Input:** The input for backward propagation includes three parts: the output of layer $ i $, $ \boldsymbol{a^{[i]}} $; the model weights and biases for layer $ i $, $ \boldsymbol{W^{[i]}} $ and $ \boldsymbol{b^{[i]}} $; and the gradient of the loss with respect to the output of layer $ i $, given as $ \frac{\partial L}{\partial a^{[i]}} $.
- **Output:** By applying the chain rule, the output is the gradient of the loss with respect to the model weights and biases of layer $ i $, expressed as $ \frac{\partial L}{\partial W^{[i]}} $ and $ \frac{\partial L}{\partial b^{[i]}} $.

### Model Weight Update

The input for updating the model weights includes the calculated gradients from the backward propagation and the learning rate $ \alpha $, which is a hyperparameter that determines the step size in the direction opposite to the gradient. The simplest gradient descent is: $\boldsymbol{W^{[l]}} = \boldsymbol{W^{[l]}} - \alpha \frac{\partial L}{\partial \boldsymbol{W^{[l]}}}$. For more complex optimizers, such as Adam {cite}`kingma2015Adam`, which introduces momentum that is the exponentially weighted average of the gradient. There is an additional matrix for maintaining the moving average of the gradients. This matrix is the state of the optimizer.
Therefore, the optimizer state, the model weights, and the gradients together serve as inputs to obtain the updated model weights.

### Neural Network Training Process

The training process for a neural network is depicted in {numref}`fig-model-training`. {numref}`fig-model-training` illustrates a 3-layer neural network, with the forward process denoted by FWD and the backward process denoted by BWD.

```{figure} ../img/ch-data-science/model-training.svg
---
width: 600px
name: fig-model-training
---
Forward (denoted by FWD) and Backward (denoted by BWD), and Model Weight Update.
```

## Inference

Model training involves both forward and backward propagation, while model inference only requires forward propagation, where the input is the data that needs to be predicted.
Loading

0 comments on commit 2120df9

Please sign in to comment.