berkeley-stat159-f17 · jackmoorer · Dec 14, 2017
diff --git a/.ipynb_checkpoints/classification-checkpoint.ipynb b/.ipynb_checkpoints/classification-checkpoint.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Explore Categorical Variables to Classify Company Status"
+    "# Exploratory Analysis #6: Explore Categorical Variables to Classify Company Status"
    ]
   },
   {
@@ -1606,7 +1606,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Remember I picked the hyperparameters of this model based on the AUC score, so let's look at our model performance bu looking at the ROC curve and AUC statistic."
+    "Remember I picked the hyperparameters of this model based on the AUC score, so let's look at our model performance but looking at the ROC curve and AUC statistic."
    ]
   },
   {
@@ -1693,7 +1693,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "I spent a lot of time reading sklearn documentation for this project on models I was familiar with, and several time I just theme suggest reading the ExtraTreesClassifier. I had not heard of an Extra Trees model before, so I did some research and read some of [this](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) paper from 2006 introducing Extremeley Randomized Trees, or in sklearn speak ExtraTrees. \n",
+    "I spent a lot of time reading sklearn documentation for this project on models I was familiar with, and several times I came across something called an ExtraTreesClassifier. I had not heard of an Extra Trees model before, so I did some research and read some of [this](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) paper from 2006 introducing Extremeley Randomized Trees, or in sklearn speak ExtraTrees. \n",
     "\n",
     "Extremeley Randomized Trees are very similar to Random Forests, and sklearn sets up the user input up in a very similar way. Extremeley Randomized Trees are similar to Random Forests in that they take a random rubsample of features, but drops the idea of bootstraping many trees samples in order to find optimal cut off points for feature node splits, and instead randomizes the picks a decision boundary at random for these node splits. This is why they are \"extremeley\" random, and as far as the bias-variance tradeoff is concerned, the model's increase in randomness seeks to further lower the variance of a model. Based on what I read, the performance of Extremley Randomized Trees can be similar, if not usually better, than that of a Random Forest."
    ]
@@ -1705,6 +1705,15 @@
     "I am going to use GridSearchCV to tune the hyperparameters of the model again, however, this time I am not tuning the number of trees, or n_estimators, of the model. There are several reasons for this, for one, in general as the number of trees increases, generally model accurately increases at the sake of increases runtime, and I this notebook already has several cells that can take a few minutes to run. In addition, as the number of estimators increases, so does generally the chance of overfitting, and I am purposely using ExtraTrees for its ability to reduce model variance. "
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": 85,

diff --git a/classification.ipynb b/classification.ipynb
@@ -1606,7 +1606,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Remember I picked the hyperparameters of this model based on the AUC score, so let's look at our model performance bu looking at the ROC curve and AUC statistic."
+    "Remember I picked the hyperparameters of this model based on the AUC score, so let's look at our model performance but looking at the ROC curve and AUC statistic."
    ]
   },
   {
@@ -1693,7 +1693,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "I spent a lot of time reading sklearn documentation for this project on models I was familiar with, and several time I just theme suggest reading the ExtraTreesClassifier. I had not heard of an Extra Trees model before, so I did some research and read some of [this](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) paper from 2006 introducing Extremeley Randomized Trees, or in sklearn speak ExtraTrees. \n",
+    "I spent a lot of time reading sklearn documentation for this project on models I was familiar with, and several times I came across something called an ExtraTreesClassifier. I had not heard of an Extra Trees model before, so I did some research and read some of [this](http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) paper from 2006 introducing Extremeley Randomized Trees, or in sklearn speak ExtraTrees. \n",
     "\n",
     "Extremeley Randomized Trees are very similar to Random Forests, and sklearn sets up the user input up in a very similar way. Extremeley Randomized Trees are similar to Random Forests in that they take a random rubsample of features, but drops the idea of bootstraping many trees samples in order to find optimal cut off points for feature node splits, and instead randomizes the picks a decision boundary at random for these node splits. This is why they are \"extremeley\" random, and as far as the bias-variance tradeoff is concerned, the model's increase in randomness seeks to further lower the variance of a model. Based on what I read, the performance of Extremley Randomized Trees can be similar, if not usually better, than that of a Random Forest."
    ]
@@ -1705,6 +1705,15 @@
     "I am going to use GridSearchCV to tune the hyperparameters of the model again, however, this time I am not tuning the number of trees, or n_estimators, of the model. There are several reasons for this, for one, in general as the number of trees increases, generally model accurately increases at the sake of increases runtime, and I this notebook already has several cells that can take a few minutes to run. In addition, as the number of estimators increases, so does generally the chance of overfitting, and I am purposely using ExtraTrees for its ability to reduce model variance. "
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": 85,