-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSpringboard_Foundations of Data Science_Capstone_MilestoneReport.Rmd
516 lines (316 loc) · 32.1 KB
/
Springboard_Foundations of Data Science_Capstone_MilestoneReport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
---
title: "Springboard Foundations of Data Science - Capstone Final Report"
author: "Arun Bharadwaj"
date: "September 7, 2017"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
library(dplyr)
library(ggplot2)
library(reshape2)
library(corrplot)
library(caTools)
library(rpart)
library(rpart.plot)
library(caret)
library(arules)
library(arulesViz)
library(datasets)
library(e1071)
library(caret)
```
# Dota 2 Capstone
## Understanding the effect of team selection, hero selection and item purchase on Dota 2 gameplay
# Introduction
The Dota 2 dataset was downloaded from Kaggle. It is being used for the Springboard Foundations of Data Science course capstone project. The dataset has 18 csv files, with the data of 50,000 ranked Dota matches.
# Approaching the Capstone
The capstone project is approached by first identifying the problem statement. Next, the most important csv`s files are identified. Data wrangling is done on the most important files followed by other files. After this, exploratory data analysis and some basic statistical correlations are performed. Finally, regressions and machine learning algorithms are used on the dataset.
# Problem Statement
The goal of the capstone is to identify potential biases in the gameplay structure of Dota 2. By gameplay structure, we refer to the more than 100 heroes and items that can be picked by players. It is important that any game, board game or computer game, measures the true skill level of a player accurately. It is possible that some specific combinations of pre-game choices affects the winning ability of a player. If such a situation were to arise, the game is unlikely to be an accurate estimate of the player`s true skill level.
# Why and for whom is this problem statement important
Valve Corporation is the developer of Dota 2. For Valve Corporation, it is of utmost importance that their games are accurate predictors of player skill level. If a player manages to find loopholes or shortcuts that allow them to win, this will have a negative effect on their reputation. Ensuring fairness and accuracy of gameplay and game mechanic is very important for any game developer.
# What specific questions must be answered, with respect to the dataset, to solve this problem statement
To solve this problem statement, we focus on heroes, combination of heroes in a team and inventory items.
At the start of a game, each player is allowed to choose a hero. Heroes are divided into three types: strength (STR), agility (AGI) and intelligence (INT). As the names suggest, each type of hero has certain advantages. We can try finding if certain types of heroes allow players to win more easily.
Many Dota games are 5 on 5 multiplayer matches. There are two teams: radiant and dire. Players are allowed to choose any hero from the list of 113 heroes. A hero already chosen by another player cannot be chosen again. So, all ten heroes in a 5 on 5 multiplayer game are unique. For a team, there are no restrictions on the types of heroes they can choose, other than the requirement that all heroes in the game must be unique. A team is free to choose any number of strength, agility and intelligence heroes.
Finally, each player can buy inventory items using the gold they get from killing creeps, destroying buildings and killing opponent heroes. The player and their hero can use this gold to buy items that improve the attack and defense ratings of the hero. Understanding if specific combinations of heroes and items give a player undue advantage can also help solve the problem statement.
# Important CSV files
match_csv, player_rating_csv and players_csv are the 3 most important csv files.
match_csv contains high level information about each match including duration, time to first blood, status of tower and barracks of two teams at end of game, geographical region where the game is taking place and the team that won the game (radiant or dire).
```{r 1,echo=FALSE}
match_csv <- read.csv("Data/match.csv")
str(match_csv)
```
player_ratings_csv has information about all the human players who played the 50,000 games. The information includes player account_id, total games they have played, total games in which they were a part of the winning team, mean of player skill and standard deviation of player skill. Player skill level and their standard deviation have been calculated using Microsoft Research`s trueskill algorithm, which is widely used in the gaming world. Valve Corporation is not involved in the way player skill is calculated. So, the player skill mean and standard deviation columns have to be cross checked for accuracy.
```{r 2,echo=FALSE}
player_ratings_csv <- read.csv("Data/player_ratings.csv")
str(player_ratings_csv)
```
players_csv has specific information about each match, player account_id that participated in the match, hero_id of the hero used by the player, gold earned and spent, xp earned, deaths, kills, assists, stuns and last hits per hero. It also gives detailed information about gold extracted from killing creeps, killing other heroes and destroying buildings.
```{r 3,echo=FALSE}
players_csv <- read.csv("Data/players.csv")
str(players_csv)
```
# Data Wrangling
## Identify most important csv files
First part of data wrangling is to identify the most important csv files out of the 18 csv files. The 3 most important files have already been identified as match_csv, player_ratings_csv and players_csv.
## Eliminate columns that do not add value to analysis
Some hist functions were run on game_mode, negative_votes and positive_votes columns of match_csv file. Results from these hist functions showed that two of these variables had only one respective value. While game_mode had two values, the column is still considered insignificant to our analysis due to the nature of game_mode. Since the three columns do not add value to our analysis, they are removed.
## Check if data points of some important variables make sense
In match_csv, the mean of first__blood_time returned a value of 93. Having played Dota, it is obvious that 93 minutes is too long for the mean of first blood time. A cursory internet search shows that in a sample tournament, first blood time was 3 minutes. It is safe to assume that the unit of first_blood_time is seconds. Close to 25000 observations have a first blood time less than 100 seconds. This seems improbable since on average, a player takes atleast 1.5 minutes to buy some inventory and head to center of the map. While first_blood_time has a unit of seconds, there may be some errors in the way it is measured.
## Convert incorrect negative values to 0
In multiple columns, there are negative value observations. For instance, some time and account id columns have negative values. This is incorrect and must be modified. Such incorrect negative values are made equal to 0.
### Why not convert incorrect negative values to NA
Majority of incorrect negative values are seen in time columns of different csv files and the account_id column of player_ratings_csv file. In the chat_csv file`s time column, which is arranged in ascending of time per match id, the negative values are very low single digit numbers. It is likely that these actions occurred at the very beginning of the game. Instead of counting them as 0 time, the data was incorrectly recorded as low negative values. So, for the sake of simplicity, all negative time values are converted to 0. For account_id, we know that players can choose to not reveal their id and are counted as anonymous. These anonymous players are counted as 0 in the account_id column. Again, for the sake of simplicity, negative account_id are also considered anonymous and converted to 0 value.
## Nullify columns with very high percentage of NA values
For the sake of simplicity, columns with very high percentage of NA values are nullified. These columns are considered to be not significant for the purpose of our analysis. This is done using the summarise_at() function of dplyr.
# Feature Engineering
## Check if some small csv files can be combined with larger csv files to make it easier to visualize data
In match_csv, a variable called cluster is shown in numeric. This variable shows the geographical region where the match is taking place. A cluster_csv file uses a key value pair, where key is numeric cluster variable and value is name of geographical region. Left_join is used to merge the match_csv and cluster_csv files to help with data visualization. Multiple such operations are performed to combine small csv`s with larger ones.
## Classify hero_id column in players_csv to STR, AGI and INT
For the ease of classification, analysis and predictive modelling, heroes are classified into STR (strength), AGI (agility) and INT (intelligence). The 113 heroes have already been classified by a former Springboard student, Louis Montague, who has posted this specific csv file on his github profile.
https://github.com/DMzMin/LMS_Dota2/blob/master/hero_names_mod.csv
## Remove rows in players_csv and players_ratings_csv where account_id = 0
account_id = 0 are rows where the player has chosen to remain anonymous and not disclose their identity. With account_id being a categorical variable, having too many rows with 0 id`s tends to distort our analysis, especially when combining csv files using account_id. So, while players_csv file is kept as it is, a new players_csv1 is created after removing rows with account_id as 0.
## Replace 0 to 4 and 128-132 in players_csv1 to Radiant and Dire
A cursory internet search shows that the player_slot column in players_csv1 actually stands for the two dota teams: radiant and dire. While 0-4 is team radiant, 128-132 is team dire. These numbers are replaced with team names in the player_slot column of players_csv1.
## Convert observations in columns item_0 to item_5 in players_csv3, from numbers to class of items to reduce number of factors
item_0 to item_5 columns contain the ID`s of the items bought by players and their respective hero. Heroes have six slots on their dashboard to keep items. The 0 to 5 stands for the six slots. While the original dataset populates item_0 to item_5 columns using the item ID, we will classify items based on class of items. The items will be converted to STR (strength), AGI (agility), INT (intelligence), MOB (mobility), MISC (miscellaneous) etc. Also, some items combine multiple classes. These items are simply classified as 2 class (item that contains combination of 2 classes), 3 class upto 5 class. This type of classification reduces number of factors to manageable limits.
item_class_1 is a newly created csv file, which duplicates item_ids_csv and adds a item_class column. The item_class column is populated with the class of items mentioned earlier. This is done using information extracted from the internet.
## player_ratings_csv2 created which only includes account_id`s that have played 5 games or more
Human players that have played less than 5 games of Dota are poor predictors of skill and gameplay fairness. player_ratings_csv2 is created that only contains players who have played more than 5 games.
# Exploratory Data Analysis
## Hero Stats
players_csv has detailed information about the player, the hero they chose and their in-game performance. This information is grouped by hero class, which are STR/INT/AGI. Further, all stats are normalized such that a particular stat of STR hero plus AGI hero plus INT hero equals one. This operation allows us to view all stats on a single heatmap. The heatmap is shown below
```{r 4,echo=FALSE}
Scaled_Stats_by_hero_type_m <- read.csv("Data/Scaled_Stats_by_hero_type_m.csv")
ggplot(Scaled_Stats_by_hero_type_m, aes(x=Var1,y=Var2,fill=value)) + geom_tile() + labs(x="Type of Hero",y="Stats") + scale_fill_gradient(low="white",high="red")
```
From the heatmap, it can be seen that, on an absolute basis, INT heroes heal the most, AGI heroes cause maximum tower damage, STR heroes cause highest stuns and AGI heroes have highest last hits.
## Percentage of Wins
While total_wins is available for each account_id, simply looking at total_wins is inadequate. More accurate is to look at percentage_wins. This is possible since we know both total_wins and total_matches.
Using this metric, we plot a histogram of percentage of wins. Highest occurence of percentage_wins is 0.5, showing that maximum players win half their games.
```{r 5,echo=FALSE}
player_ratings_csv2 <- read.csv("Data/player_ratings_csv2.csv")
# Histogram line plot of percentage_wins statistic by player account_id. Most players win half the games they play.
ggplot(player_ratings_csv2,aes(percentage_wins)) +geom_histogram(aes(y=..density..)) + geom_density(aes(y=..density..))
```
Moreover, an even larger proportion of players win between 40-60% of the games they play.
## Correlation between trueskill_mu and percentage_wins
trueskill_mu is a Microsoft Research algorithm that measures player skill. To check its accuracy, we correlate it with percentage_wins. Graph shows that higher skill level implies higher win percentage. Since player skill is hard to measure, a standard deviation measure of player ability is included. The different colors of the below dots gives the different standard deviations.
```{r 6,echo=FALSE}
# Scatter plot of trueskill_mu and percentage_wins in players_ratings_csv2. trueskill_sigma is assigned to color attribute. While trueskill_mu is indicator of player`s ability, with higher value implying better player, trueskill_sigma is the uncertainty in the trueskill_mu measure. As expected, skill and win percentage are correlated.
ggplot(player_ratings_csv2,aes(x=trueskill_mu,y=percentage_wins,col=trueskill_sigma)) +geom_point() + geom_jitter(shape=1)
```
While there is a definite correlation between trueskill_mu and percentage_wins, trueskill_sigma does not have a definite pattern, except for the dark blue shades in the center of the plot and in the top right of the plot.
## Boxplot of item types and hero levels at end of game
Below plot, drawn between item_0 and hero level, shows the Dam (Damage) item type to have highest level median, followed by Disable. Int item types have the lowest median level.
```{r 7,echo=FALSE}
players_csv2 <- read.csv("Data/players_csv2.csv")
function_boxplot_level_itemclass <- function(a){ggplot(players_csv2,aes(x=a,y=level)) + geom_boxplot() + theme(axis.text.x=element_text(angle=-90)) + labs(x="Item types bought by Hero",y="Hero level at game end")}
function_boxplot_level_itemclass(players_csv2$item_0)
```
Similar plots were made for item_1 to item_5 versus hero level. All plots had very similar final outputs.
Based on this, it can be inferred that Damage and Disable class items lead to the highest hero median level while Int item types lead to the lowest hero median level. The opposite of this statement could also be true. It is possible that Damage and Disable items are late game items that can only be bought by heroes that have substantially levelled up. The next few sections will try to infer if higher hero levels are caused by specific items or if heroes are able to buy these specific items only after reaching those higher levels.
## Lineplot of the percentage of items acquired by each hero type for each item slot location
Below line plot shows percentage of item classes acquired by each hero type. For instance, more than 35% of AGI heroes acquire 2 Class items in the item_0 inventory slot. The trend for different item slots is similar, as seen from the shape of the different line graphs.
```{r 8,echo=FALSE}
item_and_hero_m <- read.csv("Data/item_and_hero_m.csv")
ggplot(item_and_hero_m,aes(x=new,y=value,colour=variable,group=1)) + geom_point() + geom_line(aes(group=variable)) + theme(axis.text.x=element_text(angle=-90)) + labs(x="Item class of the respective hero type",y="Percentage of observations")
```
From the line plot, it can be seen that 2 Class items are most preferred by all hero types. STR heroes have a high preference for Mob (mobility) items. This is obvious since STR heroes tend to be slow moving.
## Does high median hero level lead to purchase of damage/disable items or does purchase of damage/disable items lead to high median hero level?
The above line graph was obtained after excluding item classes that constituted less than 5% of total observations per hero. Since this filter has excluded damage or disable items, it is obvious that heroes bought standalone damage or disable items less than 5% of the time. If the strategy of buying solely damage or disable items were successfull, it is likely that more players would have chosen to buy standalone damage or disable items. This graph shows that only very few players chose to buy these items. So, it is concluded that only heroes that have already attained high levels buy damage or disable items. Buying damage or disable items cannot be attributed as a reason for high hero level.
## Identify heroes that win most games
On an absolute basis, INT heroes win the maximum number of games, followed by STR and then AGI heroes. This is true for both team types.
```{r 9,echo=FALSE}
players_csv3_winning_heroes_radiant <- read.csv("Data/players_csv3_winning_heroes_radiant.csv")
players_csv3_winning_heroes_dire <- read.csv("Data/players_csv3_winning_heroes_dire.csv")
ggplot(na.omit(players_csv3_winning_heroes_radiant),aes(x=winning_heroes_radiant)) + geom_histogram(stat="count")
ggplot(na.omit(players_csv3_winning_heroes_dire),aes(x=winning_heroes_dire)) + geom_histogram(stat="count")
```
INT heroes have won maximum number of games followed by STR heroes. This trend is the same for both radiant and dire teams.
## Identify heroes that have highest probability of winning
STR heroes have highest probability of winning, followed by AGI and then INT. The difference in winning probabilities between the three heroes is negligible.
```{r 10,echo=FALSE}
Prob_Hero_Winning <- read.csv("Data/Prob_Hero_Winning.csv")
barplot(Prob_Hero_Winning$x,main="Probability of winning for each hero type",names.arg = c("INT","AGI","STR"))
```
## Team hero combinations and winnability
```{r 11,echo=FALSE}
Team_Hero_Winning <- read.csv("Data/Team_Hero_Winning.csv")
barplot(Team_Hero_Winning$x,main="Total team wins where team had >=3 heroes of same type",names.arg = c("STR","AGI","INT"))
```
Teams that have combination of >=3 INT heroes win more games, followed by teams that have combination of >= 3 STR heroes. It is inadequate to look at the absolute number of games won by teams that have multiple heroes of same type since it is possible that higher wins for teams with >=3 INT heroes maybe due to the fact that many more teams choose combination of >= 3 INT heroes. So, the probability of a team winning, when it has >= 3 INT, >= 3 STR or >= 3 AGI heroes is calculated.
```{r 11.1, echo=FALSE}
Prob_Hero_Winning_Greaterthan3 <- read.csv("Data/Prob_Hero_Winning_greaterthan3.csv")
barplot(Prob_Hero_Winning_Greaterthan3$x,main="Prob of winning where team had >=3 heros of same type",names.arg = c("STR","AGI","INT"))
```
Above plot shows that teams which pick >= 3 STR heroes have a higher probability of winning, followed by teams that pick >= 3 AGI heroes. This topic is further explored in the logistic regression area of the project.
## Association rules between hero type and item purchase
### Item Frequency Plot
Below plot shows that 2 Class items are the most commonly purchased items. This is followed by miscellaneous, mobility and health. The fact that 2 Class items are bought highest matches the earlier line plot, which showed that all three hero types have very high preference for 2 Class items.
```{r 12,echo=FALSE}
association_data1 <- read.csv("Data/association_data1.csv")
association_data1$Class <- NULL
# Convert characters to factors
association_data1 <- sapply(association_data1,as.factor)
# Convert to itemMatrix after removing Class column
# since we do not want Class column in frequency plot
association_data1 <- as.data.frame(association_data1)
association_data1 <- as(association_data1,"transactions")
# Plot item frequency
itemFrequencyPlot(association_data1,topN=15,type="absolute")
```
### General rule output
Association rules are most commonly used by the retail industry. They show the probability of a person buying item X, if they previously bought item Y in the same trip. For the capstone project, association rules are used to analyze the items purchased by STR, AGI and INT heroes.
```{r 13,echo=FALSE}
# New itemMatrix which includes Class column since for
# rules, we want to include Class
association_data2 <- players_csv2[,-c(1:44)]
association_data2 <- as.data.frame(sapply(association_data2,as.factor))
association_data2 <- as(association_data2,"transactions")
# Generate rules
rules <- apriori(association_data2,parameter = list(supp=0.001,confidence=0.8))
# Sort and inspect rules
options(digits=2)
rules <- sort(rules,by="confidence",decreasing=TRUE)
inspect(rules[1:10])
```
Rule output shows that AGI heroes have highest rule confidence. INT and STR heroes do not even feature in the top 10 rules.
### STR specific rule output
```{r 14, echo=FALSE}
# Find rules that have Class=STR on right side. But use lower
# confidence values since there are no high confidence rules with STR
rules1 <- apriori(association_data2,parameter=list(supp=0.001,confidence=0.4),appearance=list(default="lhs",rhs="Class=STR"),control=list(verbose=F))
rules1 <- sort(rules1,by="confidence",decreasing=TRUE)
inspect(rules1[1:10])
```
STR hero rules have highest rule confidence of only 0.75.
# Machine Learning
Linear regression, logistic regression and decision tree is used in the machine learning part of the capstone project. Linear regression was used to predict numeric values of trueskill_mu. Logistic regression was used to predict the probability of radiant or dire teams winning. Finally, a decision tree or CART is used to perform the same prediction as the logistic regression.
## Linear Regression (LR)
### Correlated Variables
Before performing the linear regression, a correlation plot was generated.
```{r 15,echo=FALSE}
Combined_LR_1 <- read.csv("Data/Combined_LR_1.csv")
Combined_LR_1_cor <- cor(Combined_LR_1[,-(23:27)])
corrplot(Combined_LR_1_cor,method="color",insig="blank",addCoef.col = "grey",order="AOE",cl.cex=0.5,tl.cex = 0.5,addCoefasPercent = TRUE,number.cex = 0.5)
```
### First LR model
trueskill_mu was predicted after removing all correlated variables. Following summary output was obtained.
```{r 16,echo=FALSE}
Combined_LR_2 <- read.csv("Data/Combined_LR_2.csv")
LR_model_1 <- lm(trueskill_mu ~ trueskill_sigma + gold + kills + deaths + assists + denies + last_hits + hero_healing + tower_damage + level + xp_other + gold_destroying_structure + player_slot_Radiant + player_slot_Dire + Class_STR + Class_AGI + Class_INT,data=Combined_LR_2)
summary(LR_model_1)
```
Output shows a very low R^2 value for the model. Also, the player`s team choice and hero choice are not statistically significant.
### Second LR model
All variables that were statistically not significant in the first LR model were removed and the second model was built. Following output was obtained.
```{r 17,echo=FALSE}
LR_model_2 <- lm(trueskill_mu ~ trueskill_sigma + gold + deaths + assists + last_hits + hero_healing + xp_other,data=Combined_LR_2)
summary(LR_model_2)
```
Now we have obtained a model where most variables are significant.
```{r 18, echo=FALSE}
plot(LR_model_2)
```
Residuals versus Fitted plot has some correlation between the two variables, shown by the parabolic shape of the curve.Normal Q-Q plot shows that residuals are normally distributed, based on the plot`s straight line curve.Square root standardized residuals versus fitted plot also has some correlation between the two variables.
It can be concluded that the linear regression is neither a great nor a bad model. It is a mediocre model with some flaws.
## Logistic Regression (LogR)
Logistic regression is used to predict the probability of radiant or dire team winning. To do so, the data is grouped by match_id and player_slot and all the columns are summarised by taking the sum of all column observations.
### Correlated Variables
Before performing logistic regression, a correlation plot was generated.
```{r 19,echo=FALSE}
Combined_LogR_2 <- read.csv("Data/Combined_LogR_2.csv")
# Find variables that are correlated
Combined_LogR_2_cor <- cor(Combined_LogR_2)
# Below code was taken from https://codedump.io/share/BSdmR40dKSWs/1/how-to-change-font-size-of-the-correlation-coefficient-in-corrplot
# since font size of corrplot without these changes was too big and not clear
corrplot(Combined_LogR_2_cor,method="color",insig="blank",addCoef.col = "grey",order="AOE",cl.cex=0.5,tl.cex = 0.5,addCoefasPercent = TRUE,number.cex = 0.5)
```
### First LogR model
radiant_win was predicted after removing all correlated variables. Following output was obtained.
```{r 20,echo=FALSE}
LogR_Train <- read.csv("Data/LogR_Train.csv")
LogR_model <- glm(radiant_win ~ gold + kills + deaths + denies + last_hits + stuns + hero_healing + tower_damage + xp_other + gold_other + trueskill_mu + trueskill_var + Class_STR + Class_AGI + Class_INT + duration + first_blood_time + cluster,LogR_Train,family=binomial)
summary(LogR_model)
```
Output again shows that hero choice is not statistically significant. Even player skill is not significant in determining probability of radiant or dire winning according to the logistic regression.
### Second LogR model
All variables that were not statistically significant in the first logistic regression model were removed and the second model was built.
```{r 21, echo=FALSE}
LogR_model_1 <- glm(radiant_win ~ gold + kills + deaths + last_hits + hero_healing + tower_damage + duration,LogR_Train,family=binomial)
summary(LogR_model_1)
```
Very high AIC shows that this model is also not very useful for drawing larger conclusions.
#### Accuracy of second LogR model
```{r 22,echo=FALSE}
pred_LogR_model_Train <- predict(object=LogR_model_1,type="response")
tapply(pred_LogR_model_Train,LogR_Train$radiant_win,mean)
table(LogR_Train$radiant_win,pred_LogR_model_Train>=0.5)
```
Model accuracy using the training set equals 53.5%.
### Third LogR model
A third logistic regression model uses radiant_win as the dependant variable and >=3 STR (radiant), >=3 STR (dire), >= 3 AGI (radiant), >= 3 AGI (dire), >= 3 INT (radiant), >= 3 INT (dire) as binary independant variables. This dataset is taken for all 50000 matches where radiant_win is 1, for radiant winning and 0, for dire winning. If the radiant team has used >=3 STR heroes, the n_R_STR variable is assigned 1, else 0. Similarly for other 5 variables.
```{r 22.1,echo=FALSE}
Combined_LogR_Greaterequal3_4 <- read.csv("Data/Combined_LogR_Greaterequal3_4.csv")
LogR_Greaterequal3_model <- glm(radiant_win~n_R_STR+n_R_AGI+n_R_INT+n_D_STR+n_D_AGI+n_D_INT,data=Combined_LogR_Greaterequal3_4,family = binomial)
summary(LogR_Greaterequal3_model)
```
Logistic regression shows that all 6 variables are statistically significant. n_R_STR`s positive co-efficient shows that a radiant team having >=3 STR heroes increases its chances of winning. n_D_STR has a negative co-efficient, implying that when dire teams choose >=3 STR heroes, the chances of radiant team winning decreases.
It maybe noted that the third logistic regression model had a substantially lower AIC (69096), compared to the first (87963) and second (87952) logistic regression models. Lower AIC implies better model. So, the variables relating to >= 3 hero types seem to predict probability of winning better than the variables used in the first and second logistic regression models.
## Decision Tree
Decision tree model is used to predict the probability of radiant(1) or dire(0) teams winning. The distinctive branches can be interpreted as follows: if a dire team has gold >= 11000 and if the game duration did not exceed 2054, the dire team is more likely to win the game. This rule stays the same for a radiant team also.
If a radiant team had tower damage < 2082, did not have gold >= 11000, game duration was not >= 2054, then team dire is more likely to win that game.
All leaves to the left of a splitter variable are YES and those to the right of a splitter variables are NO. This is similar to the YES and NO shown on the duration splitter at the very top of the decision tree.
```{r 23,echo=FALSE}
Combined_Tree <- read.csv("Data/Combined_Tree.csv")
# Decision tree to predict probability of winning of radiant or dire teams
set.seed(200)
Combined_Tree$match_id <- NULL
Combined_Tree$radiant_win <- as.factor(Combined_Tree$radiant_win)
# Split dataset
Combined_Tree_split <- sample.split(Combined_Tree$radiant_win,SplitRatio = 0.7)
# Assign split data into training and test datasets
Combined_Tree_Train <- subset(Combined_Tree,Combined_Tree_split == TRUE)
Combined_Tree_Test <- subset(Combined_Tree,Combined_Tree_split == FALSE)
# Build model
Combined_Tree_Model <- rpart(radiant_win ~ .,Combined_Tree_Train,method="class",control=rpart.control(minbucket = 25))
# View tree
prp(Combined_Tree_Model)
```
#### Cross Validate decision tree
```{r 24,echo=FALSE}
# Cross validate tree
# First find optimal cp value
fitControl <- trainControl(method="cv",number=10)
cartGrid <- expand.grid(.cp=(1:50)*0.01)
train(radiant_win ~ .,data=Combined_Tree_Train,method="rpart",trControl=fitControl,tuneGrid=cartGrid)
# Use this cp in new tree
Combined_Tree_Model_CP <- rpart(radiant_win ~ .,Combined_Tree_Train,method="class",control=rpart.control(cp=0.01))
```
```{r 25,echo=FALSE}
Combined_Tree_Model_CP_Predict <- predict(object=Combined_Tree_Model_CP,newdata=Combined_Tree_Test,type="class")
table(Combined_Tree_Test$radiant_win,Combined_Tree_Model_CP_Predict)
```
Final decision tree model gives an accuracy of 87.7%, substantially higher than the logistic regression.
# Conclusions
Results from the capstone project show that
1) The act of hero selection makes no difference to gameplay and winnability. This is proven by the fact that hero variables are not statistically significant in the linear and logistic regression models.
2) The act of team selection alone makes no difference to the probability of winning. This is inferred from the fact that radiant and dire variables are not statistically significant in the logistic regression model.
<<<<<<< HEAD:Capstone_Dota2_MilestoneReport.Rmd
3) Teams with >= 3 INT heroes win substantially higher number of games, followed by STR heroes.
=======
3) The collective act of team selection is statistically significant when >= 3 heroes of the same type are chosen. This is especially true for STR heroes.
>>>>>>> 6ebd08015d07867c8ad5579eeb605741dd093472:Springboard_Foundations of Data Science_Capstone_MilestoneReport.Rmd
4) trueskill_sigma variable is not used as a splitter variable in the decision tree. Similarly, in the scatter plot between trueskill_mu and percentage_wins where trueskill_sigma is used as color variable, the trueskill_sigma variable has some prominence in the lower left corner and upper right corner of the plot. Overall, there is no perceivable shape for trueskill_sigma in the scatter plot.
5) There is a very high probability of AGI heroes buying 2 Class items. As a matter of fact, rule association for AGI heroes is higher than both STR and INT heroes. Similarly, there is a very high probability of STR heroes buying Mob (mobility) items. This makes sense since STR heroes are slow moving.
# Some recommendations
1) Given that duration is a significant variable in the logistic regression and is used as a splitter variable in the decision tree, it is important to ensure that game duration has no impact on game outcome. Logistic regression shows that duration has a negative co-efficient. So, lower duration games decrease radiant team`s probability of winning, giving dire teams an advantage. Instituting a bonus for radiant teams in shorter duration games can help overcome this bias.
2) trueskill_sigma is not a splitter variable in the decision tree. trueskill_sigma does not have an impact on the scatter plot between trueskill_mu and percentage_wins. trueskill_sigma is only significant in the linear regression where trueskill_mu is the predictor. The trueskill algorithm requires review in the context of Dota 2.
3) Teams with >=3 STR heroes seem to be at a significant advantage. A relook into this aspect might be useful to building a fairer Dota 2.