We can estimate the parameters of an [[Autoregressive Model]] $\mathrm{AR}(p)$ model through three main approaches:
1. [[Linear Regression with LSE|Ordinary Least Squares]] (OLS)
2. [[Maximum Likelihood Estimation]] (MLE)
3. [[Yule-Walker Equations]]
For that we assume and $\mathrm{AR}(p)$ model:
$ X_t=c+\phi_1 X_{t-1}+\dots+\phi_pX_{t-p}+W_t $
## Ordinary Least Squares
To estimate $\phi$ we can treat it as a regular [[Multivariate Linear Regression|Linear Regression]] problem and solve via least squares estimation. This becomes apparent when we re-express the model in matrix form:
$
\underbrace{\begin{bmatrix} X_{t} \\ X_{t+1} \\ \vdots \\X_T \end{bmatrix}}_{y} =
\underbrace{\begin{bmatrix}
1 & X_{t-1} & X_{t-2}& \cdots &X_{t-p}\\
1 & X_{t-2} & X_{t-3}& \cdots &X_{t-p-1}\\
\vdots & \vdots & \vdots & \ddots & \vdots & \\
1 & X_{T-1} & X_{T-2} & \cdots & X_{T-p}
\end{bmatrix}}_{\mathbb X} *
\underbrace{\begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p\end{bmatrix}}_{\beta}+
\underbrace{\begin{bmatrix} W_1 \\ W_2 \\ \vdots \\ W_p\end{bmatrix}}_{\epsilon}
$
Following linear regression notation, where:
- $y$ as the vector of
- $\mathbb X$ as the design matrix containing lagged values
- $\beta$ as the coefficient vector
- $\epsilon$ as the noise terms
**OLS Estimator:**
We have shown how [[Linear Regression with LSE]] is solved. The estimate $\hat \beta$ is obtained from the following closed-form solution:
$ \hat \beta_{\text{OLS}} = (\mathbb X^T\mathbb X)^{-1} \mathbb X^T y $
> [!note:]
> However there are two [[Time Series as Stochastic Process|Time Series]] specific effects that need to be considered.
**Correlation Between Independent Variables:**
Consecutive terms of $X_t$ are likely to be autocorrelated, which makes the columns of the design matrix $\mathbb X$ highly collinear. This increases the variance of the OLS estimates.
Mathematically this can be seen from:
$ \mathrm{Var}(\hat \beta) = \sigma^2 (\mathbb X^T \mathbb X)^{-1} $
When the columns of the design matrix are "similar", the matrix multiplication of $\mathbb X^T \mathbb X$ becomes nearly singular, which inflates its inverse $(\mathbb X^T \mathbb X)^{-1}$ and thus increases the variance of $\hat \beta$.
**Correlation Between Error Terms:**
For a time series we expect autocorrelation by design. However, once a time series model (e.g. $\mathrm{AR}$) is well specified and fitted, the remaining error terms $W_t$ should be uncorrelated.
If residual autocorrelation persists:
- OLS can be still be used to estimate coefficients, but
- The [[Linear Regression with LSE#^540ccb|classical variance formula]] for $\hat \beta$, which assumes that $\mathrm{Var}(W)=\sigma^2 \mathbf I$ is no longer valid.
> [!note:]
> In reality the error covariance is $\sigma^2 \Omega$ (with off-diagonal elements) capturing the autocorrelation. Therefore, to get correct standard errors and confidence intervals, we need heteroskedasticity and autocorrelation consistent ("HAC") estimators.
## MLE
We can also apply [[Maximum Likelihood Estimation|MLE]] for an $\mathrm{AR}(p)$ model. Under the assumption that error terms are [[Independence and Identical Distribution|i.i.d.]] [[Gaussian Distribution|Gaussian]], both OLS and MLE will yield the same result.
$ W_t \stackrel{iid}{\sim} \mathcal N(0, \sigma^2)$
However, in general MLE only requires zero-mean and finite variance errors.
**Conditional Approach:**
The only specifics to MLE when dealing with an autoregressive model are the first $p$ observations. In the "conditional approach", we treat them as fixed, instead of modeling their likelihood.
This is a simplification that:
- Does not make a big difference in large data sets
- Avoids complexity, as it is unclear how to model terms, whose lagged terms do not exist (i.e. how to model the first term of an autoregressive series).
$ f(X_1, \dots X_T \big \vert \phi, \sigma^2) = \underbrace{f(X_1, \dots,X_t)}_{\text{initial conditions}} * \prod_{t=p+1}^T f(X_t \big \vert X_{t-1}, \dots,X_{t-p}; \phi,\sigma^2) $
The likelihood function $f(X_1, \dots, X_T \vert \phi, \sigma^2)$ contains:
- Initial conditions
- Product of individual likelihoods
**Initial Conditions:**
Since the initial conditions are treated as fixed, they do not interact with the parameters $\phi, \sigma^2$. Therefore they do not play a role in the maximization of the likelihood function.
**Individual Likelihoods:**
The $X_t$ terms exhibit autocorrelation by design. However once conditioned on its lagged variables and some chosen coefficients $\phi$, only the randomness of the noise term remains.
$ X_t \,\big \vert (X_{t-1}, \dots, X_{t-p}) \sim \mathcal N(\underbrace{\phi_1 X_{t-1}+ \cdots + \phi_p X_{t-p}}_{\text{linear predictor}}, \sigma^2) $
Hence, each likelihood observes the distribution of the noise term $W_t$, shifted by the linear predictor of the $\mathrm{AR}(p)$ model.
**Joint Log-Likelihood:**
$ LL(\phi, \sigma^2) =\sum_{t=p+1}^T \ln \left (\frac{1}{2 \pi \sigma^2} *\exp \left(-\frac{(X_t - \phi_1 X_{t-1}-\cdots-\phi_pX_{t-p})}{2 \sigma^2}\right)\right) $
Maximizing the log-likelihood $LL$ w.r.t. $\phi$ and $\sigma^2$ is equivalent to minimizing the sum of squared residuals, hence the OLS.