Skip to content

Commit

Permalink
Finished SC techniques for classification (GP included). References #11
Browse files Browse the repository at this point in the history
  • Loading branch information
unintendedbear committed Aug 28, 2017
1 parent 0153e03 commit 1b49289
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions Chapters/03-softc.tex
Original file line number Diff line number Diff line change
Expand Up @@ -93,5 +93,19 @@ \subsection{Classification in the data mining process}
\item[NNge] Nearest-Neighbor machine learning method of generating rules using non-nested generalised exemplars, i.e., the so called `hyperrectangles' for being multidimensional rectangular regions of attribute space \cite{Martin1995}. The NNge algorithm builds a ruleset from the creation of this hyperrectangles. They are non-nested (overlapping is not permitted), which means that the algorithm checks, when a proposed new hyperrectangle created from a new generalisation, if it has conflicts with any region of the attribute space. This is done in order to avoid that an example is covered by more than one rule (two or more).
\item[PART] It comes from `partial' decision trees, for it builds its rule set from them \cite{Frank1998}. The way of generating a partial decision tree is a combination of the two aforementioned strategies ``divide-and-conquer'' and ``separate-and-conquer'', gaining then flexibility and speed. When a tree begins to grow, the node with lowest information gain is the chosen one for starting to expand. When a subtree is complete (it has reached its leaves), its substitution by a single leaf is considered. At the end the algorithm obtains a partial decision tree instead of a fully explored one, because the leafs with largest coverage become rules and some subtrees are thus discarded.
\end{description}

Another SC technique that can be used for classification is Genetic Programming (GP), which has been proposed in literature for dealing with the problem of discovering novel, interesting knowledge and rules from large amounts of data \cite{freitas2002data}, given that the up-to-date approaches are based in general pre-defined of manually defined rules \cite{ali2015analysis}. Considered part of the so-called \emph{Evolutionary Algorithms} [\cite{back1996evolutionary}], GP is an optimization technique inspired by natural evolution. One of the advantages of GP is that by making the solutions to a problem internally encoded as trees, they themselves can be seen as decision tree classifiers [\cite{safavian1990survey}] and can be expressed as a set of rules.

The GP approach also has the advantage of being novel. To the best of our knowledge, there is still not a tool that helps CSOs in
developing new security rules via GP, even as this method has been indeed applied to classification, as described by Espejo et al. in \cite{espejo2010survey}. In fact, their survey theoretically supports our decision of applying GP to obtain security rules in a BYOD environment.

In our case, the assigned classes or action to take -- right part of the security rule -- would be the leaves of the tree, whilst the nodes are the conditions that have to be met to apply the class -- left part of the rule -- . Taking this into account, GP can be used to generate these classification trees, optimising an objective function called {\em fitness}. In this case the fitness can be defined as the accuracy of a rule or set of rules, being this the most used metric in classification \cite{witten2016data}, along with the classification error. But since there are other metrics that influence in ``how good'' a rule or a set of rules is, such as the depth of the created tree, the number of nodes it has, or the obtained false positives
\cite{back1996evolutionary}, it would be of convenience to use them in the definition of the fitness.

The coding of the individual might take two approaches, named \textit{Pittsburgh} and \textit{Michigan} \cite{freitas2002data}. The Pittsburgh uses GP to create an individual tree that model a set of different rules, given that the problem can be seen as
a classification one and therefore the model can be a decision tree \cite{safavian1990survey}. The second approach, called Michigan approach, assigns a single rule to every individual. The rule can be expressed thus as a list of conditions, with a fixed class, obtaining just one rule per execution. That means we are not using GP in this case, because the generated individual is not a tree, but a vector, so we are applying a regular Genetic Algorithm (GA) instead.

Indeed, each approach has its advantages and disadvantages. The Pittsburgh approach allows to directly obtain a set of rules able to classify instances of every existing class, meanwhile Michigan approach's solution is coded as a single rule, so that we obtain as many rules as classes are defined. The possibility of having many rules for every class instead of just one, more general, per class might seem to better help the CSO in detecting specific dangerous situations.
At the same time, obtaining a set of rules as solution is more computationally expensive due to the need of longer evaluations. Lastly, to evaluate a single rule not taking into account how it interacts with what the others cover \cite{freitas2002data} can lead to massive overlapping with the consequent loss of efficiency.

\subsection{Data visualisation and interpretation}

0 comments on commit 1b49289

Please sign in to comment.