Overview

Definition

  • Field of study that gives computers the ability to learn without explicitly programmed. Arthur Samuel(1959)
  • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell(1998)

Type of problem

RegressionClassification
OutcomeContinuous valueDiscrete value
ExampleHousing priceBreast cancer
AlgorithmLinear regressionLogistic regression, SVM

Type of model

Discriminative modelGenerative model
GoalEstimate $P(y|x)$ directlyEstimate $P(x|y)$ then deduce $P(y|x)$
What's learnedDesicion boundaryProbability distribution of data
AlgorithmLogistic regression, SVMNaive Bayes, Latent Dirichlet Allocation(LDA)

Notations

  • Hypothesis: The hypothesis is denotes as $h_{\theta}$ and is the model we want to select. For a given data $x^{i}$ the model prediction is $h_{\theta}(x^{(i)})$

  • Loss function: The difference between the predicted value $z$ according to the selected model and the real data value $y$. The common loss functions include least square error, logistic loss, hinge loss, and cross entropy loss.

  • Cost function: The cost function $J$ is used to measure the performance of a model, and is defined with loss function as follows:

\[ J(\theta) = \sum_{i=1}^{m} L(h_{\theta}(x^{i}, y^{i})) \]

  • Gradient descent: The update rule to optimize the model parameters by the set learning rate $\alpha$ and the computed gradient of each parameter.

\[\theta \gets \theta - \alpha\nabla J(\theta)\]

and more explicitly:

\[ \text{repeat until convergence} \{ \theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta) \} \]