Skip to content

Commit

Permalink
Merge pull request dmlc#307 from pommedeterresautee/master
Browse files Browse the repository at this point in the history
cleaning Rmarkdown
  • Loading branch information
tqchen committed May 10, 2015
2 parents d3564f3 + 3104f1f commit 6f56e0f
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 28 deletions.
2 changes: 1 addition & 1 deletion R-package/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ install.packages('xgboost')
## Examples

* Please visit [walk through example](demo).
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset.
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset and the one related to [Otto challenge](../demo/kaggle-otto), including a [RMarkdown documentation](../demo/kaggle-otto/understandingXGBoostModel.Rmd).
57 changes: 30 additions & 27 deletions demo/kaggle-otto/understandingXGBoostModel.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,17 @@ Let's explore the dataset.
dim(train)
# Training content
train[1:6, 1:5, with =F]
train[1:6,1:5, with =F]
# Test dataset dimensions
dim(train)
# Test content
test[1:6, 1:5, with =F]
test[1:6,1:5, with =F]
```
> We only display the 6 first rows and 5 first columns for convenience
Each *column* represents a feature measured by an integer. Each *row* is an **Otto** product.
Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product.

Obviously the first column (`ID`) doesn't contain any useful information.

Expand All @@ -75,18 +75,19 @@ train[1:6, ncol(train), with = F]
nameLastCol <- names(train)[ncol(train)]
```

The classes are provided as character string in the **`r ncol(train)`**th column called **`r nameLastCol`**. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to integers. Moreover, according to the documentation, it should start at 0.
The classes are provided as character string in the `r ncol(train)`th column called `r nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to `integer`. Moreover, according to the documentation, it should start at `0`.

For that purpose, we will:

* extract the target column
* remove "Class_" from each class name
* convert to integers
* remove 1 to the new value
* remove `Class_` from each class name
* convert to `integer`
* remove `1` to the new value

```{r classToIntegers}
# Convert from classes to numbers
y <- train[, nameLastCol, with = F][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1}
# Display the first 5 levels
y[1:5]
```
Expand All @@ -97,7 +98,7 @@ We remove label column from training dataset, otherwise **XGBoost** would use it
train[, nameLastCol:=NULL, with = F]
```

`data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in numeric Matrix format.
`data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in `numeric` Matrix format.

```{r convertToNumericMatrix}
trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
Expand All @@ -107,12 +108,11 @@ testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
Model training
==============

Before the learning we will use the cross validation to evaluate the error rate.

Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part and use it as the test data. Then it will reintegrate the first part to the training dataset and retain the second part, do a training and so on...
Before the learning we will use the cross validation to evaluate the our error rate.

Look at the function documentation for more information.
Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part to use it as the test data and perform a training. Then it will reintegrate the first part and retain the second part, do a training and so on...

You can look at the function documentation for more information.

```{r crossValidation}
numberOfClasses <- max(y) + 1
Expand Down Expand Up @@ -144,21 +144,21 @@ Feature importance

So far, we have built a model made of **`r nround`** trees.

To build a *tree*, the dataset is divided recursively `max.depth` times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).

Each division operation is called a *split*.

Each group at each division level is called a *branch* and the deepest level is called a *leaf*.
Each group at each division level is called a branch and the deepest level is called a *leaf*.

In the final model, these leafs are supposed to be as pure as possible for each tree, meaning in our case that each leaf should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).

**Not all splits are equally important**. Basically the first split of a tree will have more impact on the purity that, for instance, the deepest split. Intuitively, we understand that the first split makes most of the work, and the following splits focus on smaller parts of the dataset which have been missclassified by the first tree.
**Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been missclassified by the first *tree*.

In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first tree will do most of the work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees.
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*.

The improvement brought by each split can be measured, it is the *gain*.
The improvement brought by each *split* can be measured, it is the *gain*.

Each split is done on one feature only at one specific value.
Each *split* is done on one feature only at one value.

Let's see what the model looks like.

Expand All @@ -170,11 +170,11 @@ model[1:10]
Clearly, it is not easy to understand what it means.

Basically each line represents a branch, there is the tree ID, the feature ID, the point where it splits, and information regarding the next branches (left, right, when the row for this feature is N/A).
Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A).

Hopefully, **XGBoost** offers a better representation: **feature importance**.

Feature importance is about averaging the gain of each feature for all split and all trees.
Feature importance is about averaging the *gain* of each feature for all *split* and all *trees*.

Then we can use the function `xgb.plot.importance`.

Expand All @@ -189,18 +189,18 @@ importance_matrix <- xgb.importance(names, model = bst)
xgb.plot.importance(importance_matrix[1:10,])
```

> To make the graph understandable we first extract the column names from the `Matrix`.
> To make it understandable we first extract the column names from the `Matrix`.
Interpretation
--------------

In the feature importance above, we can see the first 10 most important features.

This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance.
This function gives a color to each bar. These colors represent groups of features. Basically a K-means clustering is applied to group each feature by importance.

From here you can take several actions. For instance you can remove the less important features (feature selection process), or go deeper in the interaction between the most important features and labels.
From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.

Or you can try to guess why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information).
Or you can just reason about why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information).

Tree graph
----------
Expand All @@ -209,20 +209,23 @@ Feature importance gives you feature weight information but not interaction betw

**XGBoost R** package have another useful function for that.

Please, scroll on the right to see the tree.

```{r treeGraph, dpi=1500, fig.align='left'}
xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
```

We are just displaying the first two trees here.

On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
Besides, **XGBoost** generate `K` trees at each round for a `K`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.

Going deeper
============

There are 3 documents you may be interested in:
There are 4 documents you may also be interested in:

* [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation
* [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysus
* [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case
* [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model

0 comments on commit 6f56e0f

Please sign in to comment.