Skip to content

Latest commit

 

History

History
97 lines (78 loc) · 3.12 KB

00-Introduction.md

File metadata and controls

97 lines (78 loc) · 3.12 KB

Introduction to Machine learning

What is ML?

  • Instead of writing a program, given examples.
  • Represent data with features.
  • All models are wrong

Model Learning and Prediction:

  • Predictive and non-predictive (or classification).
  • Tomorrow's weather: (predictive) / interpreting cancer MRIs (not predictive).

Machine learning terms:

  • Start with training data.
  • Data is made of examples or instances.
  • Target class or value.
  • Each example has attributes.
  • The attributes that we use are features.
  • Generate a model = trained algorithm.
  • Evaluate using test data.

Classification:

  • Given a name, is it male or female?

Regression:

  • The target domain is continuous. E.g. placing point of data on a scale.
  • Example, given (size, cut, breadth, color, ...) of a gem, what is its value?

Classification vs. regression:

  • Yes/no questions, it is classification.
  • Continuous value (EUR:USD rate), is regression.

Single and structured prediction:

  • Independent elements (image classification).
  • Intrinsic structure (sentence, people: height, age, gender).

Structured prediction:

  • PoS tagging.
  • Items are assigned jointly.
  • The result of one item affects the decisions made about another.

Bayesian learning:

  • Baye's theorem: for a hypothesis h, data D: P(h | D) = P(D | h) * P(h) / P(D).
  • Combining prior knowledge or model of the observed data.

Bayes MAP:

  • Most probable hypothesis.
  • hMAP = argmax {h in H} P(h | D).
  • Can happen that all hypothesis are equally likely. For that, we only care about P(D | h).

Naïve Bayes classifier:

  • Assume a target function f: X → V.
  • X has examples with attributes.
  • vMap is argmax {vj in V} P(a1, a2, an | vj) | P(vj).
  • Assume that the attributes doesn't depend on each other. P(a1, ..., an | vj) = PROD P(ai | j vj).
  • Independence assumption: assumes all features are independent. What if things work together?
  • What if nothing in D has a certain feature value. P(ai | vj) = 0, so P(vj) PROD P(ai | vj) = 0.

Decision trees:

  • Akinator
  • Decisions about one attirbute at a time.
  • Each path through a decision tree forms a conjunction of attribute tests.
  • The idea is to reduce the entropy, to be more certain.

Representations:

  • Real world → data really determinant on how the algorithm will behave.

Models:

  • Most ML algorithms are parameterised.
  • MDL.

Evaluation:

  • How good is the model?
  • Accuracy = |D_train ^ correct| / |D|.
  • Precision: how many things we said were class A, were actually class A?
    • % of things found that were correct.
  • Recall: how many things in class A, did we find correctly?
    • % things we should have found, that we did find.
  • Solution: harmonic mean.
  • F-score: F = 2 (P * R) / (P + R).
  • There is the variant F_beta, which weights more or less.
  • ROC curve. Two axis: true positive rate vs. false positive.

Cross-evaluation:

  • Split data into folds.

Overfitting:

  • Regularization.
  • Keep some data out. Validation.
  • Don't make strong hypotheses from small datasets.

Test/train sanitation:

  • With train data: don't even look at it?
  • Scalar = monotonocity or order (integers, reals)
  • Errors can be made on both sides: hinge loss.
  • Mimics human biases.