What is ML?
- Instead of writing a program, given examples.
- Represent data with features.
- All models are wrong
Model Learning and Prediction:
- Predictive and non-predictive (or classification).
- Tomorrow's weather: (predictive) / interpreting cancer MRIs (not predictive).
Machine learning terms:
- Start with training data.
- Data is made of examples or instances.
- Target class or value.
- Each example has attributes.
- The attributes that we use are features.
- Generate a model = trained algorithm.
- Evaluate using test data.
Classification:
- Given a name, is it male or female?
Regression:
- The target domain is continuous. E.g. placing point of data on a scale.
- Example, given (size, cut, breadth, color, ...) of a gem, what is its value?
Classification vs. regression:
- Yes/no questions, it is classification.
- Continuous value (EUR:USD rate), is regression.
Single and structured prediction:
- Independent elements (image classification).
- Intrinsic structure (sentence, people: height, age, gender).
Structured prediction:
- PoS tagging.
- Items are assigned jointly.
- The result of one item affects the decisions made about another.
Bayesian learning:
- Baye's theorem: for a hypothesis h, data D: P(h | D) = P(D | h) * P(h) / P(D).
- Combining prior knowledge or model of the observed data.
Bayes MAP:
- Most probable hypothesis.
- hMAP = argmax {h in H} P(h | D).
- Can happen that all hypothesis are equally likely. For that, we only care about P(D | h).
Naïve Bayes classifier:
- Assume a target function f: X → V.
- X has examples with attributes.
- vMap is argmax {vj in V} P(a1, a2, an | vj) | P(vj).
- Assume that the attributes doesn't depend on each other. P(a1, ..., an | vj) = PROD P(ai | j vj).
- Independence assumption: assumes all features are independent. What if things work together?
- What if nothing in D has a certain feature value. P(ai | vj) = 0, so P(vj) PROD P(ai | vj) = 0.
Decision trees:
- Akinator
- Decisions about one attirbute at a time.
- Each path through a decision tree forms a conjunction of attribute tests.
- The idea is to reduce the entropy, to be more certain.
Representations:
- Real world → data really determinant on how the algorithm will behave.
Models:
- Most ML algorithms are parameterised.
- MDL.
Evaluation:
- How good is the model?
- Accuracy = |D_train ^ correct| / |D|.
- Precision: how many things we said were class A, were actually class A?
- % of things found that were correct.
- Recall: how many things in class A, did we find correctly?
- % things we should have found, that we did find.
- Solution: harmonic mean.
- F-score: F = 2 (P * R) / (P + R).
- There is the variant F_beta, which weights more or less.
- ROC curve. Two axis: true positive rate vs. false positive.
Cross-evaluation:
- Split data into folds.
Overfitting:
- Regularization.
- Keep some data out. Validation.
- Don't make strong hypotheses from small datasets.
Test/train sanitation:
- With train data: don't even look at it?
- Scalar = monotonocity or order (integers, reals)
- Errors can be made on both sides: hinge loss.
- Mimics human biases.