Regularization

Idea: If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.

Options:

Reduce number of features
- Manually select
- Model selection algorithm
Regularization
- Keep all features, but reduce magnitude/ values of parameters $\theta_{j}$.
- Works well when we have a lot of features, each of which contributes a bit to predict $y$.

LASSO	Ridge	Elastic Net
Shrinks coefficients to 0, Good for variable selection	Makes coefficients smaller	Tradeoff between variable selection and small coefficients

$...+\lambda \| \| \theta \| \| _{1}$	$...+\lambda \|\| \theta \|\|_{2}^{2}$	$...+\lambda [ (1-\alpha) \|\| \theta \|\|_{1} + \alpha \|\|\theta \|\| _{2}^{2} ]$

Last updated on Dec 3, 2019