Estimation of Autoregressive Model

We can estimate the parameters of an [[Autoregressive Model]] $\mathrm{AR}(p)$ model through three main approaches: 1. [[Linear Regression with LSE|Ordinary Least Squares]] (OLS) 2. [[Maximum Likelihood Estimation]] (MLE) 3. [[Yule-Walker Equations]] For that we assume and $\mathrm{AR}(p)$ model: $ X_t=c+\phi_1 X_{t-1}+\dots+\phi_pX_{t-p}+W_t $ ## Ordinary Least Squares To estimate $\phi$ we can treat it as a regular [[Multivariate Linear Regression|Linear Regression]] problem and solve via least squares estimation. This becomes apparent when we re-express the model in matrix form: $ \underbrace{\begin{bmatrix} X_{t} \\ X_{t+1} \\ \vdots \\X_T \end{bmatrix}}_{y} = \underbrace{\begin{bmatrix} 1 & X_{t-1} & X_{t-2}& \cdots &X_{t-p}\\ 1 & X_{t-2} & X_{t-3}& \cdots &X_{t-p-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \\ 1 & X_{T-1} & X_{T-2} & \cdots & X_{T-p} \end{bmatrix}}_{\mathbb X} * \underbrace{\begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_p\end{bmatrix}}_{\beta}+ \underbrace{\begin{bmatrix} W_1 \\ W_2 \\ \vdots \\ W_p\end{bmatrix}}_{\epsilon} $ Following linear regression notation, where: - $y$ as the vector of - $\mathbb X$ as the design matrix containing lagged values - $\beta$ as the coefficient vector - $\epsilon$ as the noise terms **OLS Estimator:** We have shown how [[Linear Regression with LSE]] is solved. The estimate $\hat \beta$ is obtained from the following closed-form solution: $ \hat \beta_{\text{OLS}} = (\mathbb X^T\mathbb X)^{-1} \mathbb X^T y $ > [!note:] > However there are two [[Time Series as Stochastic Process|Time Series]] specific effects that need to be considered. **Correlation Between Independent Variables:** Consecutive terms of $X_t$ are likely to be autocorrelated, which makes the columns of the design matrix $\mathbb X$ highly collinear. This increases the variance of the OLS estimates. Mathematically this can be seen from: $ \mathrm{Var}(\hat \beta) = \sigma^2 (\mathbb X^T \mathbb X)^{-1} $ When the columns of the design matrix are "similar", the matrix multiplication of $\mathbb X^T \mathbb X$ becomes nearly singular, which inflates its inverse $(\mathbb X^T \mathbb X)^{-1}$ and thus increases the variance of $\hat \beta$. **Correlation Between Error Terms:** For a time series we expect autocorrelation by design. However, once a time series model (e.g. $\mathrm{AR}$) is well specified and fitted, the remaining error terms $W_t$ should be uncorrelated. If residual autocorrelation persists: - OLS can be still be used to estimate coefficients, but - The [[Linear Regression with LSE#^540ccb|classical variance formula]] for $\hat \beta$, which assumes that $\mathrm{Var}(W)=\sigma^2 \mathbf I$ is no longer valid. > [!note:] > In reality the error covariance is $\sigma^2 \Omega$ (with off-diagonal elements) capturing the autocorrelation. Therefore, to get correct standard errors and confidence intervals, we need heteroskedasticity and autocorrelation consistent ("HAC") estimators. ## MLE We can also apply [[Maximum Likelihood Estimation|MLE]] for an $\mathrm{AR}(p)$ model. Under the assumption that error terms are [[Independence and Identical Distribution|i.i.d.]] [[Gaussian Distribution|Gaussian]], both OLS and MLE will yield the same result. $ W_t \stackrel{iid}{\sim} \mathcal N(0, \sigma^2)$ However, in general MLE only requires zero-mean and finite variance errors. **Conditional Approach:** The only specifics to MLE when dealing with an autoregressive model are the first $p$ observations. In the "conditional approach", we treat them as fixed, instead of modeling their likelihood. This is a simplification that: - Does not make a big difference in large data sets - Avoids complexity, as it is unclear how to model terms, whose lagged terms do not exist (i.e. how to model the first term of an autoregressive series). $ f(X_1, \dots X_T \big \vert \phi, \sigma^2) = \underbrace{f(X_1, \dots,X_t)}_{\text{initial conditions}} * \prod_{t=p+1}^T f(X_t \big \vert X_{t-1}, \dots,X_{t-p}; \phi,\sigma^2) $ The likelihood function $f(X_1, \dots, X_T \vert \phi, \sigma^2)$ contains: - Initial conditions - Product of individual likelihoods **Initial Conditions:** Since the initial conditions are treated as fixed, they do not interact with the parameters $\phi, \sigma^2$. Therefore they do not play a role in the maximization of the likelihood function. **Individual Likelihoods:** The $X_t$ terms exhibit autocorrelation by design. However once conditioned on its lagged variables and some chosen coefficients $\phi$, only the randomness of the noise term remains. $ X_t \,\big \vert (X_{t-1}, \dots, X_{t-p}) \sim \mathcal N(\underbrace{\phi_1 X_{t-1}+ \cdots + \phi_p X_{t-p}}_{\text{linear predictor}}, \sigma^2) $ Hence, each likelihood observes the distribution of the noise term $W_t$, shifted by the linear predictor of the $\mathrm{AR}(p)$ model. **Joint Log-Likelihood:** $ LL(\phi, \sigma^2) =\sum_{t=p+1}^T \ln \left (\frac{1}{2 \pi \sigma^2} *\exp \left(-\frac{(X_t - \phi_1 X_{t-1}-\cdots-\phi_pX_{t-p})}{2 \sigma^2}\right)\right) $ Maximizing the log-likelihood $LL$ w.r.t. $\phi$ and $\sigma^2$ is equivalent to minimizing the sum of squared residuals, hence the OLS.