The train error will always be better the more variables. Therefore we need a metric that balances data fit and model complexity.
**AIC:** For a model with $k$ parameters, the $\text{AIC}$ (”Akaike Information Criterion”):
$ AIC=\underbrace{\underbrace{2k}_{\text{complexity}} -2\log \text{likelihood}}_{\text{data fit}} $
**BIC:** For a model with $k$ parameters and $n$ observations, the $\text{BIC}$ (”Bayesian Information Criterion”):
$ BIC=-2\log \text{likelihood} +k \ln(n) $
The [[Identities of Log-Likelihood|Log-Likelihood]] captures how likely the observed data is, given the estimated model parameters $(\hat \phi1, \dots, \hat \phi_k)$. We seek to minimize $\text{AIC}$. Therefore we need to take the negative log-likelihood (goodness-of-fit of the data) and add the penalty for $k$ parameters.