Adding weka+classifiers description. References #11

unintendedbear · Aug 11, 2017 · 1cbe02a · 1cbe02a
1 parent b5fb4dd
commit 1cbe02a
Show file tree

Hide file tree

Showing 2 changed files with 118 additions and 1 deletion.
diff --git a/Chapters/03-softc.tex b/Chapters/03-softc.tex
@@ -13,7 +13,7 @@
 
 Therefore, in this chapter we will look over the techniques that take part in the process, from selecting the data in the database to actually obtaining the set of rules and visualise them, via data preprocessing and transformation. Furthermore, the soft computing techniques that have been applied in research for the problem of classification -- extraction of rules -- and visualisation are also detailed.
 
-It is important to note that this process, known as ``Knowledge Discovery in Databases'' (KDD) is not the proposed methodology itself, but part of it. These are the necessary steps to obtain the rules, but the proposed algorithms can be used or not depending on the use case, which will be defined later.
+It is important to note that this process, known as ``Knowledge Discovery in Databases'' (KDD) is not the proposed methodology itself, but part of it. These are the necessary steps and methods to obtain the rules, but the proposed algorithms can be used or not depending on the use case, which will be defined later.
 
 \section{The Knowledge Discovery in Databases process}
 
@@ -27,6 +27,23 @@ \section{Data transformation}
 
 \section{Soft computing techniques applied to data mining and visualisation}
 
+Until now, for the other parts of the process, we have focused in some characteristics of the dataset, such as the number of attributes, whether it has missing values, or the difference between cases belonging to each class. But the type of data for every attribute is also important, because it will determine the kind of algorithms that can be used.
+
+This way, an attribute can be \textit{numeric} or \textit{nominal} \cite{witten2016data}. Numeric attributes measure continuous values such as integers and real numbers, and boolean as well, whilst nominal attributes -- also named \textit{categorical} -- take their value from a predefined, finite set of possibilities. In what follows we will overview the algorithms that can be used in the cases that, like ours, the data is mostly nominal.
+
 \subsection{Classification in the data mining process}
 
+Inside KDD, the process of classification, or application of classifying algorithms, helps in building a model of the data set, and to understand the relationships therein. As previously said, the data coming from BYOD practises is usually not only numerical or nominal, thus, only classification algorithms that support both types of data can be considered. Weka \cite{weka:site} is a collection of State-of-the-Art machine learning algorithms and data preprocessing tools that are key for data mining processes \cite{witten2016data}. On the other hand, it is important that for our purposes we focus on rule-based and decision-tree-based algorithms. A decision-tree algorithm is a group of conditions organised in a top-down recursive manner in a way that a class is assigned following a path of conditions, from the root of the tree to one of its leaves. Generally speaking, the possible classes to choose are mutually exclusive. Furthermore, these algorithms are also called ``divide-and-conquer'' algorithms. On the other hand, there are the ``separate-and-conquer'' algorithms, which work creating rules one at a time, then the instances covered by the created rule are removed and the next rule is generated from the remaining instances. The most important characteristic of these algorithms is that the model that is built from the dataset is expressed in the form of a set of rules.
+
+Inside the rule-based and decision tree-based algorithms, there is a great number of possible algorithms to work with, we have conducted a preselection phase trying to choose those which would yield better results in the experiments. A reference to each Weka classifier can be found at \cite{witten2016data}. Below are described the top five techniques, obtained from the best results  of the experiments done in this stage, along with more specific bibliography. Naïve Bayes method \cite{Bayesian_Classifier_97} has been included as a baseline, normally used in text categorization problems. According to the results, the five selected classifiers are much better than this method.
+
+\begin{description}
+  \item[Naïve Bayes] It is the classification technique that we have used as a reference for either its simplicity and ease to understand. Its basis relies on the Bayes Theorem and the possibility of represent the relationship between two random variables as a Bayesian network \cite{rish2001empirical}. Then, by assigning values to the variables probabilities, the probabilities of the occurrences between them can be obtained. Thus, assuming that a set of attributes are independent one from another, and using the Bayes Theorem, patterns can be classified without the need of trees or rule creation, just by calculating probabilities.
+   \item[J48] This classifier generates a pruned or unpruned C4.5 decision tree. Described for the first time in 1993 by \cite{Quinlan1993}, this machine learning method builds a decision tree selecting, for each node, the best attribute for splitting and create the next nodes. An attribute is selected as `the best' by evaluating the difference in entropy (information gain) resulting from choosing that attribute for splitting the data. In this way, the tree continues to grow till there are not attributes anymore for further splitting, meaning that the resulting nodes are instances of single classes. 
+   \item[Random Forest] This manner of building a decision tree can be seen as a randomization of the previous C4.5 process. It was stated by \cite{Breiman2001} and consist of, instead of choosing `the best' attribute, the algorithm randomly chooses one between a group of attributes from the top ones. The size of this group is customizable in Weka.
+   \item[REP Tree] Is another kind of decision tree, it means Reduced Error Pruning Tree. Originally stated by \cite{Quinlan1987}, this method builds a decision tree using information gain, like C4.5, and then prunes it using reduced-error pruning. That means that the training dataset is divided in two parts: one devoted to make the tree grow and another for pruning. For every subtree (not a class/leaf) in the tree, it is replaced by the best possible leaf in the pruning three and then it is tested with the test dataset if the made prune has improved the results. A deep analysis about this technique and its variants can be found in \cite{Elomaa2001}.
+   \item[NNge] Nearest-Neighbor machine learning method of generating rules using non-nested generalised exemplars, i.e., the so called `hyperrectangles' for being multidimensional rectangular regions of attribute space \cite{Martin1995}. The NNge algorithm builds a ruleset from the creation of this hyperrectangles. They are non-nested (overlapping is not permitted), which means that the algorithm checks, when a proposed new hyperrectangle created from a new generalisation, if it has conflicts with any region of the attribute space. This is done in order to avoid that an example is covered by more than one rule (two or more).
+   \item[PART] It comes from `partial' decision trees, for it builds its rule set from them \cite{Frank1998}. The way of generating a partial decision tree is a combination of the two aforementioned strategies ``divide-and-conquer'' and ``separate-and-conquer'', gaining then flexibility and speed. When a tree begins to grow, the node with lowest information gain is the chosen one for starting to expand. When a subtree is complete (it has reached its leaves), its substitution by a single leaf is considered. At the end the algorithm obtains a partial decision tree instead of a fully explored one, because the leafs with largest coverage become rules and some subtrees are thus discarded.
+ \end{description} 
+
 \subsection{Data visualisation and interpretation}
diff --git a/tesis.bib b/tesis.bib
@@ -425,3 +425,103 @@ @article{fayyad1996data
   pages={37},
   year={1996}
 }
+
+@inproceedings{Frank1998,
+   author = {Eibe Frank and Ian H. Witten},
+   booktitle = {Fifteenth International Conference on Machine Learning},
+   editor = {J. Shavlik},
+   pages = {144-151},
+   publisher = {Morgan Kaufmann},
+   title = {Generating Accurate Rule Sets Without Global Optimization},
+   year = {1998},
+   PS = {http://www.cs.waikato.ac.nz/\~eibe/pubs/ML98-57.ps.gz}
+}
+
+@ARTICLE{Bayesian_Classifier_97,
+	AUTHOR       = {Domingos, Pedro and Pazzani, Michael},
+	TITLE        = {On the optimality of the simple Bayesian classifier under zero-one loss},
+	JOURNAL      = {Machine Learning},
+	YEAR         = {1997},
+	VOLUME       = {29},
+	PAGES        = {103-137}
+}
+
+@article{Breiman2001,
+   author = {Leo Breiman},
+   journal = {Machine Learning},
+   number = {1},
+   pages = {5-32},
+   title = {Random Forests},
+   volume = {45},
+   year = {2001}
+}
+
+@book{Quinlan1993,
+   address = {San Mateo, CA},
+   author = {J. R. Quinlan},
+   publisher = {Morgan Kaufmann Publishers},
+   title = {C4.5: Programs for Machine Learning},
+   year = {1993}
+}
+
+@article{Quinlan1987,
+   author = {J. R. Quinlan},
+   journal = {Man-Machine Studies},
+   number = {3},
+   pages = {221-234},
+   title = {Simplifying decision trees},
+   volume = {27},
+   year = {1987}
+}
+
+@article{Elomaa2001,
+   author = {T. Elomaa and M. Kaariainen},
+   journal = {Artificial Intelligence Research},
+   number = {-},
+   pages = {163-187},
+   title = {An Analysis of Reduced Error Pruning},
+   volume = {15},
+   year = {2001}
+}
+
+@mastersthesis{Martin1995,
+   address = {Hamilton, New Zealand},
+   author = {Brent Martin},
+   school = {University of Waikato},
+   title = {Instance-Based learning: Nearest Neighbor With Generalization},
+   year = {1995}
+}
+
+@book{witten2016data,
+  title={Data Mining: Practical machine learning tools and techniques},
+  author={Witten, Ian H and Frank, Eibe and Hall, Mark A and Pal, Christopher J},
+  year={2016},
+  publisher={Morgan Kaufmann}
+}
+
+@book{Frank2011,
+  author = {Eibe Frank and Ian H. Witten},
+  edition = {Third},
+  howpublished = {Paperback},
+  isbn = {978-0-12-374856-0},
+  month = {February},
+  publisher = {Morgan Kaufmann Publishers},
+  title = {Data Mining: Practical Machine Learning Tools and Techniques},
+  year = 2011
+}
+
+@TECHREPORT{oreilly_perl07,
+	AUTHOR       = {O'Reilly, T., Smith, B.},
+	TITLE        = {The Importance Of Perl},
+	YEAR         = {2007},
+	webpage = {http://oreillynet.com/pub/a/oreilly/perl/news/importance_0498.html},
+	lastaccess = {September, 2014}
+}
+
+@misc{weka:site,
+  author = {University of Waikato},
+  title = {Weka},
+  year = {1993},
+  webpage = {http://www.cs.waikato.ac.nz/~ml/weka/},
+  lastaccess = {September, 2014}
+}