The train error will always be better the more variables. Therefore we need a metric that balances data fit and model complexity. **AIC:** For a model with $k$ parameters, the $\text{AIC}$ (”Akaike Information Criterion”): $ AIC=\underbrace{\underbrace{2k}_{\text{complexity}} -2\log \text{likelihood}}_{\text{data fit}} $ **BIC:** For a model with $k$ parameters and $n$ observations, the $\text{BIC}$ (”Bayesian Information Criterion”): $ BIC=-2\log \text{likelihood} +k \ln(n) $ The [[Identities of Log-Likelihood|Log-Likelihood]] captures how likely the observed data is, given the estimated model parameters $(\hat \phi1, \dots, \hat \phi_k)$. We seek to minimize $\text{AIC}$. Therefore we need to take the negative log-likelihood (goodness-of-fit of the data) and add the penalty for $k$ parameters.