Springboard_Foundations of Data Science_Capstone_MilestoneReport.Rmd

---
title: "Springboard Foundations of Data Science - Capstone Final Report"
author: "Arun Bharadwaj"
date: "September 7, 2017"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(corrplot)
library(caTools)
library(rpart)
library(rpart.plot)
library(caret)
library(arules)
library(arulesViz)
library(datasets)
library(e1071)
library(caret)
```

# Dota 2 Capstone 

## Understanding the effect of team selection, hero selection and item purchase on Dota 2 gameplay

# Introduction

The Dota 2 dataset was downloaded from Kaggle. It is being used for the Springboard Foundations of Data Science course capstone project. The dataset has 18 csv files, with the data of 50,000 ranked Dota matches. 

# Approaching the Capstone

The capstone project is approached by first identifying the problem statement. Next, the most important csv`s files are identified. Data wrangling is done on the most important files followed by other files. After this, exploratory data analysis and some basic statistical correlations are performed. Finally, regressions and machine learning algorithms are used on the dataset.

# Problem Statement

The goal of the capstone is to identify potential biases in the gameplay structure of Dota 2. By gameplay structure, we refer to the more than 100 heroes and items that can be picked by players. It is important that any game, board game or computer game, measures the true skill level of a player accurately. It is possible that some specific combinations of pre-game choices affects the winning ability of a player. If such a situation were to arise, the game is unlikely to be an accurate estimate of the player`s true skill level. 

# Why and for whom is this problem statement important

Valve Corporation is the developer of Dota 2. For Valve Corporation, it is of utmost importance that their games are accurate predictors of player skill level. If a player manages to find loopholes or shortcuts that allow them to win, this will have a negative effect on their reputation. Ensuring fairness and accuracy of gameplay and game mechanic is very important for any game developer.

# What specific questions must be answered, with respect to the dataset, to solve this problem statement

To solve this problem statement, we focus on heroes, combination of heroes in a team and inventory items.

At the start of a game, each player is allowed to choose a hero. Heroes are divided into three types: strength (STR), agility (AGI) and intelligence (INT). As the names suggest, each type of hero has certain advantages. We can try finding if certain types of heroes allow players to win more easily.

Many Dota games are 5 on 5 multiplayer matches. There are two teams: radiant and dire. Players are allowed to choose any hero from the list of 113 heroes. A hero already chosen by another player cannot be chosen again. So, all ten heroes in a 5 on 5 multiplayer game are unique. For a team, there are no restrictions on the types of heroes they can choose, other than the requirement that all heroes in the game must be unique. A team is free to choose any number of strength, agility and intelligence heroes. 

Finally, each player can buy inventory items using the gold they get from killing creeps, destroying buildings and killing opponent heroes. The player and their hero can use this gold to buy items that improve the attack and defense ratings of the hero. Understanding if specific combinations of heroes and items give a player undue advantage can also help solve the problem statement.

# Important CSV files

match_csv, player_rating_csv and players_csv are the 3 most important csv files. 

match_csv contains high level information about each match including duration, time to first blood, status of tower and barracks of two teams at end of game, geographical region where the game is taking place and the team that won the game (radiant or dire).
```{r 1,echo=FALSE}
match_csv <- read.csv("Data/match.csv")
str(match_csv)
```

player_ratings_csv has information about all the human players who played the 50,000 games. The information includes player account_id, total games they have played, total games in which they were a part of the winning team, mean of player skill and standard deviation of player skill. Player skill level and their standard deviation have been calculated using Microsoft Research`s trueskill algorithm, which is widely used in the gaming world. Valve Corporation is not involved in the way player skill is calculated. So, the player skill mean and standard deviation columns have to be cross checked for accuracy.
```{r 2,echo=FALSE}
player_ratings_csv <- read.csv("Data/player_ratings.csv")
str(player_ratings_csv)
```

players_csv has specific information about each match, player account_id that participated in the match, hero_id of the hero used by the player, gold earned and spent, xp earned, deaths, kills, assists, stuns and last hits per hero. It also gives detailed information about gold extracted from killing creeps, killing other heroes and destroying buildings.  
```{r 3,echo=FALSE}
players_csv <- read.csv("Data/players.csv")
str(players_csv)
```

# Data Wrangling

## Identify most important csv files
First part of data wrangling is to identify the most important csv files out of the 18 csv files. The 3 most important files have already been identified as match_csv, player_ratings_csv and players_csv. 

## Eliminate columns that do not add value to analysis
Some hist functions were run on game_mode, negative_votes and positive_votes columns of match_csv file. Results from these hist functions showed that two of these variables had only one respective value. While game_mode had two values, the column is still considered insignificant to our analysis due to the nature of game_mode.  Since the three columns do not add value to our analysis, they are removed.

## Check if data points of some important variables make sense 
In match_csv, the mean of first__blood_time returned a value of 93. Having played Dota, it is obvious that 93 minutes is too long for the mean of first blood time. A cursory internet search shows that in a sample tournament, first blood time was 3 minutes. It is safe to assume that the unit of first_blood_time is seconds. Close to 25000 observations have a first blood time less than 100 seconds. This seems improbable since on average, a player takes atleast 1.5 minutes to buy some inventory and head to center of the map. While first_blood_time has a unit of seconds, there may be some errors in the way it is measured.

## Convert incorrect negative values to 0
In multiple columns, there are negative value observations. For instance, some time and account id columns have negative values. This is incorrect and must be modified. Such incorrect negative values are made equal to 0.

### Why not convert incorrect negative values to NA
Majority of incorrect negative values are seen in time columns of different csv files and the account_id column of player_ratings_csv file. In the chat_csv file`s time column, which is arranged in ascending of time per match id, the negative values are very low single digit numbers. It is likely that  these actions occurred at the very beginning of the game. Instead of counting them as 0 time, the data was incorrectly recorded as low negative values. So, for the sake of simplicity, all negative time values are converted to 0. For account_id, we know that players can choose to not reveal their id and are counted as anonymous. These anonymous players are  counted as 0 in the account_id column. Again, for the sake of simplicity, negative account_id are also considered anonymous and converted to 0 value. 

## Nullify columns with very high percentage of NA values
For the sake of simplicity, columns with very high percentage of NA values are nullified. These columns are considered to be not significant for the purpose of our analysis. This is done using the summarise_at() function of dplyr.

# Feature Engineering

## Check if some small csv files can be combined with larger csv files to make it easier to visualize data
In match_csv, a variable called cluster is shown in numeric. This variable shows the geographical region where the match is taking place. A cluster_csv file uses a key value pair, where key is numeric cluster variable and value is name of geographical region. Left_join is used to merge the match_csv and cluster_csv files to help with data visualization. Multiple such operations are performed to combine small csv`s with larger ones.

## Classify hero_id column in players_csv to STR, AGI and INT 
For the ease of classification, analysis and predictive modelling, heroes are classified into STR (strength), AGI (agility) and INT (intelligence). The 113 heroes have already been classified by a former Springboard student, Louis Montague, who has posted this specific csv file on his github profile.

https://github.com/DMzMin/LMS_Dota2/blob/master/hero_names_mod.csv

## Remove rows in players_csv and players_ratings_csv where account_id = 0
account_id = 0 are rows where the player has chosen to remain anonymous and not disclose their identity. With account_id being a categorical variable, having too many rows with 0 id`s tends to distort our analysis, especially when combining csv files using account_id. So, while players_csv file is kept as it is, a new players_csv1 is created after removing rows with account_id as 0.

## Replace 0 to 4 and 128-132 in players_csv1 to Radiant and Dire
A cursory internet search shows that the player_slot column in players_csv1 actually stands for the two dota teams: radiant and dire. While 0-4 is team radiant, 128-132 is team dire. These numbers are replaced with team names in the player_slot column of players_csv1.

## Convert observations in columns item_0 to item_5 in players_csv3, from numbers to class of items to reduce number of factors
item_0 to item_5 columns contain the ID`s of the items bought by players and their respective hero. Heroes have six slots on their dashboard to keep items. The 0 to 5 stands for the six slots. While the original dataset populates item_0 to item_5 columns using the item ID, we will classify items based on class of items. The items will be converted to STR (strength), AGI (agility), INT (intelligence), MOB (mobility), MISC (miscellaneous) etc. Also, some items combine multiple classes. These items are simply classified as 2 class (item that contains combination of 2 classes), 3 class upto 5 class. This type of classification reduces number of factors to manageable limits.

item_class_1 is a newly created csv file, which duplicates item_ids_csv and adds a item_class column. The item_class column is populated with the class of items mentioned earlier. This is done using information extracted from the internet.

## player_ratings_csv2 created which only includes account_id`s that have played 5 games or more
Human players that have played less than 5 games of Dota are poor predictors of skill and gameplay fairness. player_ratings_csv2 is created that only contains players who have played more than 5 games.

# Exploratory Data Analysis

## Hero Stats
players_csv has detailed information about the player, the hero they chose and their in-game performance. This information is grouped by hero class, which are STR/INT/AGI. Further, all stats are normalized such that a particular stat of STR hero plus AGI hero plus INT hero equals one. This operation allows us to view all stats on a single heatmap. The heatmap is shown below
```{r 4,echo=FALSE}
Scaled_Stats_by_hero_type_m <- read.csv("Data/Scaled_Stats_by_hero_type_m.csv")

ggplot(Scaled_Stats_by_hero_type_m, aes(x=Var1,y=Var2,fill=value)) + geom_tile() + labs(x="Type of Hero",y="Stats") + scale_fill_gradient(low="white",high="red")

```

From the heatmap, it can be seen that, on an absolute basis, INT heroes heal the most, AGI heroes cause maximum tower damage, STR heroes cause highest stuns and AGI heroes have highest last hits.

## Percentage of Wins
While total_wins is available for each account_id, simply looking at total_wins is inadequate. More accurate is to look at percentage_wins. This is possible since we know both total_wins and total_matches.

Using this metric, we plot a histogram of percentage of wins. Highest occurence of percentage_wins is 0.5, showing that maximum players win half their games.

```{r 5,echo=FALSE}

player_ratings_csv2 <- read.csv("Data/player_ratings_csv2.csv")

# Histogram line plot of percentage_wins statistic by player account_id. Most players win half the games they play.

ggplot(player_ratings_csv2,aes(percentage_wins)) +geom_histogram(aes(y=..density..)) + geom_density(aes(y=..density..))

```

Moreover, an even larger proportion of players win between 40-60% of the games they play.

## Correlation between trueskill_mu and percentage_wins
trueskill_mu is a Microsoft Research algorithm that measures player skill. To check its accuracy, we correlate it with percentage_wins. Graph shows that higher skill level implies higher win percentage. Since player skill is hard to measure, a standard deviation measure of player ability is included. The different colors of the below dots gives the different standard deviations. 

```{r 6,echo=FALSE}


# Scatter plot of trueskill_mu and percentage_wins in players_ratings_csv2. trueskill_sigma is assigned to color attribute. While trueskill_mu is indicator of player`s ability, with higher value implying better player, trueskill_sigma is the uncertainty in the trueskill_mu measure. As expected, skill and win percentage are correlated. 

ggplot(player_ratings_csv2,aes(x=trueskill_mu,y=percentage_wins,col=trueskill_sigma)) +geom_point() + geom_jitter(shape=1)
```

While there is a definite correlation between trueskill_mu and percentage_wins, trueskill_sigma does not have a definite pattern, except for the dark blue shades in the center of the plot and in the top right of the plot. 

## Boxplot of item types and hero levels at end of game

Below plot, drawn between item_0 and hero level, shows the Dam (Damage) item type to have highest level median, followed by Disable. Int item types have the lowest median level.


```{r 7,echo=FALSE}
players_csv2 <- read.csv("Data/players_csv2.csv")
function_boxplot_level_itemclass <- function(a){ggplot(players_csv2,aes(x=a,y=level)) + geom_boxplot() + theme(axis.text.x=element_text(angle=-90)) + labs(x="Item types bought by Hero",y="Hero level at game end")}

function_boxplot_level_itemclass(players_csv2$item_0)

```

Similar plots were made for item_1 to item_5 versus hero level. All plots had very similar final outputs.

Based on this, it can be inferred that Damage and Disable class items lead to the highest hero median level while Int item types lead to the lowest hero median level. The opposite of this statement could also be true. It is possible that Damage and Disable items are late game items that can only be bought by heroes that have substantially levelled up. The next few sections will try to infer if higher hero levels are caused by specific items or if heroes are able to buy these specific items only after reaching those higher levels.

## Lineplot of the percentage of items acquired by each hero type for each item slot location

Below line plot shows percentage of item classes acquired by each hero type. For instance, more than 35% of AGI heroes acquire 2 Class items in the item_0 inventory slot. The trend for different item slots is similar, as seen from the shape of the different line graphs. 

```{r 8,echo=FALSE}
item_and_hero_m <- read.csv("Data/item_and_hero_m.csv")

ggplot(item_and_hero_m,aes(x=new,y=value,colour=variable,group=1)) + geom_point() + geom_line(aes(group=variable)) + theme(axis.text.x=element_text(angle=-90)) + labs(x="Item class of the respective hero type",y="Percentage of observations") 

```

From the line plot, it can be seen that 2 Class items are most preferred by all hero types. STR heroes have a high preference for Mob (mobility) items. This is obvious since STR heroes tend to be slow moving.  


## Does high median hero level lead to purchase of damage/disable items or does purchase of damage/disable items lead to high median hero level?

The above line graph was obtained after excluding item classes that constituted less than 5% of total observations per hero. Since this filter has excluded damage or disable items, it is obvious that heroes bought standalone damage or disable items less than 5% of the time. If the strategy of buying solely damage or disable items were successfull, it is likely that more players would have chosen to buy standalone damage or disable items. This graph shows that only very few players chose to buy these items. So, it is concluded that only heroes that have already attained high levels buy damage or disable items. Buying damage or disable items cannot be attributed as a reason for high hero level.

## Identify heroes that win most games

On an absolute basis, INT heroes win the maximum number of games, followed by STR and then AGI heroes. This is true for both team types.

```{r 9,echo=FALSE}
players_csv3_winning_heroes_radiant <- read.csv("Data/players_csv3_winning_heroes_radiant.csv")

players_csv3_winning_heroes_dire <- read.csv("Data/players_csv3_winning_heroes_dire.csv")

ggplot(na.omit(players_csv3_winning_heroes_radiant),aes(x=winning_heroes_radiant)) + geom_histogram(stat="count")

ggplot(na.omit(players_csv3_winning_heroes_dire),aes(x=winning_heroes_dire)) + geom_histogram(stat="count")
```


INT heroes have won maximum number of games followed by STR heroes. This trend is the same for both radiant and dire teams.

## Identify heroes that have highest probability of winning

STR heroes have highest probability of winning, followed by AGI and then INT. The difference in winning probabilities between the three heroes is negligible. 

```{r 10,echo=FALSE}
Prob_Hero_Winning <- read.csv("Data/Prob_Hero_Winning.csv")

barplot(Prob_Hero_Winning$x,main="Probability of winning for each hero type",names.arg = c("INT","AGI","STR"))

```


## Team hero combinations and winnability

```{r 11,echo=FALSE}
Team_Hero_Winning <- read.csv("Data/Team_Hero_Winning.csv")

barplot(Team_Hero_Winning$x,main="Total team wins where team had >=3 heroes of same type",names.arg = c("STR","AGI","INT"))

```


Teams that have combination of >=3 INT heroes win more games, followed by teams that have combination of >= 3 STR heroes. It is inadequate to look at the absolute number of games won by teams that have multiple heroes of same type since it is possible that higher wins for teams with >=3 INT heroes maybe due to the fact that many more teams choose combination of >= 3 INT heroes. So, the probability of a team winning, when it has >= 3 INT, >= 3 STR or >= 3 AGI heroes is calculated.

```{r 11.1, echo=FALSE}
Prob_Hero_Winning_Greaterthan3 <- read.csv("Data/Prob_Hero_Winning_greaterthan3.csv")

barplot(Prob_Hero_Winning_Greaterthan3$x,main="Prob of winning where team had >=3 heros of same type",names.arg = c("STR","AGI","INT"))

```

Above plot shows that teams which pick >= 3 STR heroes have a higher probability of winning, followed by teams that pick >= 3 AGI heroes. This topic is further explored in the logistic regression area of the project.

## Association rules between hero type and item purchase

### Item Frequency Plot

Below plot shows that 2 Class items are the most commonly purchased items. This is followed by miscellaneous, mobility and health. The fact that 2 Class items are bought highest matches the earlier line plot, which showed that all three hero types have very high preference for 2 Class items.

```{r 12,echo=FALSE}

association_data1 <- read.csv("Data/association_data1.csv")

association_data1$Class <- NULL

    # Convert characters to factors
association_data1 <- sapply(association_data1,as.factor)

    # Convert to itemMatrix after removing Class column 
    # since we do not want Class column in frequency plot
association_data1 <- as.data.frame(association_data1)
association_data1 <- as(association_data1,"transactions")

    # Plot item frequency
itemFrequencyPlot(association_data1,topN=15,type="absolute")

```


### General rule output

Association rules are most commonly used by the retail industry. They show the probability of a person buying item X, if they previously bought item Y in the same trip. For the capstone project, association rules are used to analyze the items purchased by STR, AGI and INT heroes.

```{r 13,echo=FALSE}
# New itemMatrix which includes Class column since for 
    # rules, we want to include Class
association_data2 <- players_csv2[,-c(1:44)]
association_data2 <- as.data.frame(sapply(association_data2,as.factor))
association_data2 <- as(association_data2,"transactions")

  # Generate rules
rules <- apriori(association_data2,parameter = list(supp=0.001,confidence=0.8))

    # Sort and inspect rules
options(digits=2)
rules <- sort(rules,by="confidence",decreasing=TRUE)
inspect(rules[1:10])

```

Rule output shows that AGI heroes have highest rule confidence. INT and STR heroes do not even feature in the top 10 rules. 

### STR specific rule output

```{r 14, echo=FALSE}
    # Find rules that have Class=STR on right side. But use lower 
    # confidence values since there are no high confidence rules with STR
rules1 <- apriori(association_data2,parameter=list(supp=0.001,confidence=0.4),appearance=list(default="lhs",rhs="Class=STR"),control=list(verbose=F))
rules1 <- sort(rules1,by="confidence",decreasing=TRUE)
inspect(rules1[1:10])

```

STR hero rules have highest rule confidence of only 0.75.

# Machine Learning

Linear regression, logistic regression and decision tree is used in the machine learning part of the capstone project. Linear regression was used to predict numeric values of trueskill_mu. Logistic regression was used to predict the probability of radiant or dire teams winning. Finally, a decision tree or CART is used to perform the same prediction as the logistic regression.  

## Linear Regression (LR)

### Correlated Variables

Before performing the linear regression, a correlation plot was generated.

```{r 15,echo=FALSE}
Combined_LR_1 <- read.csv("Data/Combined_LR_1.csv")

Combined_LR_1_cor <- cor(Combined_LR_1[,-(23:27)])

corrplot(Combined_LR_1_cor,method="color",insig="blank",addCoef.col = "grey",order="AOE",cl.cex=0.5,tl.cex = 0.5,addCoefasPercent = TRUE,number.cex = 0.5)
```

### First LR model

trueskill_mu was predicted after removing all correlated variables. Following summary output was obtained.

```{r 16,echo=FALSE}
Combined_LR_2 <- read.csv("Data/Combined_LR_2.csv")
LR_model_1 <- lm(trueskill_mu ~ trueskill_sigma + gold + kills + deaths + assists + denies + last_hits + hero_healing + tower_damage + level + xp_other + gold_destroying_structure + player_slot_Radiant + player_slot_Dire + Class_STR + Class_AGI + Class_INT,data=Combined_LR_2)
summary(LR_model_1)
```

Output shows a very low R^2 value for the model. Also, the player`s team choice and hero choice are not statistically significant. 

### Second LR model 

All variables that were statistically not significant in the first LR model were removed and the second model was built. Following output was obtained. 

```{r 17,echo=FALSE}
LR_model_2 <- lm(trueskill_mu ~ trueskill_sigma + gold + deaths + assists + last_hits + hero_healing  + xp_other,data=Combined_LR_2)
summary(LR_model_2)
```

Now we have obtained a model where most variables are significant. 

```{r 18, echo=FALSE}
plot(LR_model_2)
```

Residuals versus Fitted plot has some correlation between the two variables, shown by the parabolic shape of the curve.Normal Q-Q plot shows that residuals are normally distributed, based on the plot`s straight line curve.Square root standardized residuals versus fitted plot also has some correlation between the two variables.

It can be concluded that the linear regression is neither a great nor a bad model. It is a mediocre model with some flaws.

## Logistic Regression (LogR)

Logistic regression is used to predict the probability of radiant or dire team winning. To do so, the data is grouped by match_id and player_slot and all the columns are summarised by taking the sum of all column observations.

### Correlated Variables

Before performing logistic regression, a correlation plot was generated.

```{r 19,echo=FALSE}
Combined_LogR_2 <- read.csv("Data/Combined_LogR_2.csv")

    # Find variables that are correlated
Combined_LogR_2_cor <- cor(Combined_LogR_2)

        # Below code was taken from https://codedump.io/share/BSdmR40dKSWs/1/how-to-change-font-size-of-the-correlation-coefficient-in-corrplot 
        # since font size of corrplot without these changes was too big and not clear 

corrplot(Combined_LogR_2_cor,method="color",insig="blank",addCoef.col = "grey",order="AOE",cl.cex=0.5,tl.cex = 0.5,addCoefasPercent = TRUE,number.cex = 0.5)

```

### First LogR model

radiant_win was predicted after removing all correlated variables. Following output was obtained.

```{r 20,echo=FALSE}
LogR_Train <- read.csv("Data/LogR_Train.csv")
LogR_model <- glm(radiant_win ~ gold + kills + deaths + denies + last_hits + stuns + hero_healing + tower_damage + xp_other + gold_other + trueskill_mu + trueskill_var + Class_STR + Class_AGI + Class_INT + duration + first_blood_time + cluster,LogR_Train,family=binomial)
summary(LogR_model)
```

Output again shows that hero choice is not statistically significant. Even player skill is not significant in determining probability of radiant or dire winning according to the logistic regression. 

### Second LogR model

All variables that were not statistically significant in the first logistic regression model were removed and the second model was built.

```{r 21, echo=FALSE}
LogR_model_1 <- glm(radiant_win ~ gold + kills + deaths + last_hits + hero_healing + tower_damage + duration,LogR_Train,family=binomial)
summary(LogR_model_1)
```

Very high AIC shows that this model is also not very useful for drawing larger conclusions.

#### Accuracy of second LogR model

```{r 22,echo=FALSE}
pred_LogR_model_Train <- predict(object=LogR_model_1,type="response")
tapply(pred_LogR_model_Train,LogR_Train$radiant_win,mean)
table(LogR_Train$radiant_win,pred_LogR_model_Train>=0.5)
```

Model accuracy using the training set equals 53.5%.

### Third LogR model

A third logistic regression model uses radiant_win as the dependant variable and >=3 STR (radiant), >=3 STR (dire), >= 3 AGI (radiant), >= 3 AGI (dire), >= 3 INT (radiant), >= 3 INT (dire) as binary independant variables. This dataset is taken for all 50000 matches where radiant_win is 1, for radiant winning and 0, for dire winning. If the radiant team has used >=3 STR heroes, the n_R_STR variable is assigned 1, else 0. Similarly for other 5 variables.

```{r 22.1,echo=FALSE}

Combined_LogR_Greaterequal3_4 <- read.csv("Data/Combined_LogR_Greaterequal3_4.csv")

LogR_Greaterequal3_model <- glm(radiant_win~n_R_STR+n_R_AGI+n_R_INT+n_D_STR+n_D_AGI+n_D_INT,data=Combined_LogR_Greaterequal3_4,family = binomial)

summary(LogR_Greaterequal3_model)

```

Logistic regression shows that all 6 variables are statistically significant. n_R_STR`s positive co-efficient shows that a radiant team having >=3 STR heroes increases its chances of winning. n_D_STR has a negative co-efficient, implying that  when dire teams choose >=3 STR heroes, the chances of radiant team winning decreases. 

It maybe noted that the third logistic regression model had a substantially lower AIC (69096), compared to the first (87963) and second (87952) logistic regression models. Lower AIC implies better model. So, the variables relating to >= 3 hero types seem to predict probability of winning better than the variables used in the first and second logistic regression models.


## Decision Tree

Decision tree model is used to predict the probability of radiant(1) or dire(0) teams winning. The distinctive branches can be interpreted as follows: if a dire team has gold >= 11000 and if the game duration did not exceed 2054, the dire team is more likely to win the game. This rule stays the same for a radiant team also. 

If a radiant team had tower damage < 2082, did not have gold >= 11000, game duration was not >= 2054, then team dire is more likely to win that game. 

All leaves to the left of a splitter variable are YES and those to the right of a splitter variables are NO. This is similar to the YES and NO shown on the duration splitter at the very top of the decision tree.

```{r 23,echo=FALSE}
Combined_Tree <- read.csv("Data/Combined_Tree.csv")

# Decision tree to predict probability of winning of radiant or dire teams
set.seed(200)

Combined_Tree$match_id <- NULL
Combined_Tree$radiant_win <- as.factor(Combined_Tree$radiant_win)

     # Split dataset
Combined_Tree_split <- sample.split(Combined_Tree$radiant_win,SplitRatio = 0.7)

     # Assign split data into training and test datasets
Combined_Tree_Train <- subset(Combined_Tree,Combined_Tree_split == TRUE)
Combined_Tree_Test <- subset(Combined_Tree,Combined_Tree_split == FALSE)


     # Build model
Combined_Tree_Model <- rpart(radiant_win ~  .,Combined_Tree_Train,method="class",control=rpart.control(minbucket = 25))

     # View tree
prp(Combined_Tree_Model)

```


#### Cross Validate decision tree

```{r 24,echo=FALSE}
# Cross validate tree
    # First find optimal cp value
fitControl <- trainControl(method="cv",number=10)
cartGrid <- expand.grid(.cp=(1:50)*0.01)
train(radiant_win ~ .,data=Combined_Tree_Train,method="rpart",trControl=fitControl,tuneGrid=cartGrid) 

    # Use this cp in new tree
Combined_Tree_Model_CP <- rpart(radiant_win ~ .,Combined_Tree_Train,method="class",control=rpart.control(cp=0.01))
```

```{r 25,echo=FALSE}
Combined_Tree_Model_CP_Predict <- predict(object=Combined_Tree_Model_CP,newdata=Combined_Tree_Test,type="class")
table(Combined_Tree_Test$radiant_win,Combined_Tree_Model_CP_Predict)

```
Final decision tree model gives an accuracy of 87.7%, substantially higher than the logistic regression.

# Conclusions

Results from the capstone project show that 

1) The act of hero selection makes no difference to gameplay and winnability. This is proven by the fact that hero variables are not statistically significant in the linear and logistic regression models. 

2) The act of team selection alone makes no difference to the probability of winning. This is inferred from the fact that radiant and dire variables are not statistically significant in the logistic regression model.

<<<<<<< HEAD:Capstone_Dota2_MilestoneReport.Rmd
3) Teams with >= 3 INT heroes win substantially higher number of games, followed by STR heroes.
=======
3) The collective act of team selection is statistically significant when >= 3 heroes of the same type are chosen. This is especially true for STR heroes.  
>>>>>>> 6ebd08015d07867c8ad5579eeb605741dd093472:Springboard_Foundations of Data Science_Capstone_MilestoneReport.Rmd

4) trueskill_sigma variable is not used as a splitter variable in the decision tree. Similarly, in the scatter plot between trueskill_mu and percentage_wins where trueskill_sigma is used as color variable, the trueskill_sigma variable has some prominence in the lower left corner and upper right corner of the plot. Overall, there is no perceivable shape for trueskill_sigma in the scatter plot. 

5) There is a very high probability of AGI heroes buying 2 Class items. As a matter of fact, rule association for AGI heroes is higher than both STR and INT heroes. Similarly, there is a very high probability of STR heroes buying Mob (mobility) items. This makes sense since STR heroes are slow moving. 

# Some recommendations

1) Given that duration is a significant variable in the logistic regression and is used as a splitter variable in the decision tree, it is important to ensure that game duration has no impact on game outcome. Logistic regression shows that duration has a negative co-efficient. So, lower duration games decrease radiant team`s probability of winning, giving dire teams an advantage. Instituting a bonus for radiant  teams in shorter duration games can help overcome this bias.

2) trueskill_sigma is not a splitter variable in the decision tree. trueskill_sigma does not have an impact on the scatter plot between trueskill_mu and percentage_wins. trueskill_sigma is only significant in the linear regression where trueskill_mu is the predictor. The trueskill algorithm requires review in the context of Dota 2. 

3) Teams with >=3 STR heroes seem to be at a significant advantage. A relook into this aspect might be useful to building a fairer Dota 2.