Overview

Definition

Field of study that gives computers the ability to learn without explicitly programmed. Arthur Samuel(1959)
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Tom Mitchell(1998)

	Discriminative model	Generative model
Goal	Estimate $P(y\|x)$ directly	Estimate $P(x\|y)$ then deduce $P(y\|x)$
What's learned	Desicion boundary	Probability distribution of data
Algorithm	Logistic regression, SVM	Naive Bayes, Latent Dirichlet Allocation(LDA)

Hypothesis: The hypothesis is denotes as $h_{\theta}$ and is the model we want to select. For a given data $x^{i}$ the model prediction is $h_{\theta}(x^{(i)})$
Loss function: The difference between the predicted value $z$ according to the selected model and the real data value $y$. The common loss functions include least square error, logistic loss, hinge loss, and cross entropy loss.
Cost function: The cost function $J$ is used to measure the performance of a model, and is defined with loss function as follows:

\[ J(\theta) = \sum_{i=1}^{m} L(h_{\theta}(x^{i}, y^{i})) \]

Gradient descent: The update rule to optimize the model parameters by the set learning rate $\alpha$ and the computed gradient of each parameter.

\[\theta \gets \theta - \alpha\nabla J(\theta)\]

and more explicitly:

\[ \text{repeat until convergence} \{ \theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta) \} \]

Last updated on Dec 3, 2019