The multinomial distribution is a generalization of the [[Binomial Distribution]]. We have an experiment with $N$ trials, where each trail can have $K$ different outcomes (modalities), which are denoted by $a_j$.
Thus, the key difference between Binomial and Multinomial is the distribution of the single trial:
- *Binomial:* Many independent [[Bernoulli Distribution|Bernoulli]] trials.
- *Multinomial:* Many trials from a categorical distribution.
$
\begin{align}
\mathrm{Ber}(p)=
\begin{cases} 0& \text{w.p.} & (1-p),\\ 1& \text{w.p.} & p,\\ \end{cases} \qquad
\mathrm{Categorical}(a_1,\dots,a_n)=
\begin{cases} a_1 & \text{w.p.} & p_1\\ \dots& \text{w.p.} & \dots\\ a_k & \text{w.p.} & p_k\\ \end{cases}
\end{align}
$
Both can be written in compact form with exponents.
$
\begin{align}
\mathrm{Ber}(p)&= p^x*(1-p)^{(1-x)}\\[2pt]
\mathrm{Categorial(a_1, \dots,a_n)}&= p_1^{\mathbf 1_{(X=a_1)}}*\dots* p_k^{\mathbf 1_{(X=a_k)}} =\prod_{j=1}^k p_j^{\mathbf 1_{(X=a_j)}}
\end{align}
$
## Maximum Likelihood Estimator
To construct the [[Maximum Likelihood Estimation#Estimator|MLE Estimator]] for a multinomial distribution, follow the usual steps:
of finding the likelihood function, transforming to log-likelihood, and taking the [[Partial Derivative]] w.r.t. each parameter $p_j$ of the log-likelihood.
*Step 1: Find likelihood function*
$
\begin{align}
L_N(X, \overrightarrow p)&= \prod_{i=1}^n \left(p_1^{\mathbf 1_{(X_i=a_1)}}*\dots* p_k^{\mathbf 1_{(X_i=a_k)}} \right) \tag{1}\\[4pt]
&=p_1^{\sum \mathbf 1(X_i=a_1)}* \dots * p_k^{\sum \mathbf 1(X_i=a_k)} \tag{2}\\[6pt]
&=p_1^{N_1}*\dots*p_k^{N_k} \tag{3}\\[4pt]
\end{align}
$
where:
- (1) The [[Likelihood Functions|likelihood]] of a multinomial from $N$ [[Independence and Identical Distribution|i.i.d.]] trails with $k$ possible output classes, is the product of the single PMFs.
- (2) We can write the product of exponentials with the same base as [[Algebra of Exponents|sum in exponents]].
- (3) The sum of indicators over all $X_i$ reflects the counts of $a_j$ within the data. We denote it by $N_j$.
*Step 2: Compute log-likelihood*
For the log-likelihood, we take the natural logarithm to bring down the exponents. Furthermore we can express $p_k$ as a function of all other $p_j$ as they need to sum to $1$.
$
\begin{align}
\ell_N(X, \overrightarrow p)&=N_1\ln(p_1)+\dots+N_k\ln(p_k) \\
&=N_1\ln(p_1)+\dots+N_k\ln(1-\sum_{j=1}^{k-1}p_j)
\end{align}
$
*Step 3: Taking derivatives*
Take the [[Partial Derivative]] w.r.t. each parameter $p_j$ of log-likelihood and set it to zero. Since the log-likelihood has a [[Vector Operations|Vector]] of parameters $[p_1, \dots, p_k]$, we take partial derivatives for each $p_j$ separately and set them to $0$.
$
\frac{\partial}{\partial p_j} :\frac{N_j}{p_j}+\frac{N_k}{1-\sum_{j=1}^{k-1}p_j}*(-1)=0
$
This creates a system of $(k-1)$ equations where the right side is always the same. For simplicity, we abbreviate it with $\gamma$.
$ \begin{cases} \frac{N_1}{p_1}=\frac{N_k}{1-\sum_{j=1}^{k-1}p_j} = \gamma \\[10pt]
\cdots = \cdots \\[10pt]
\frac{N_k}{p_{(k-1)}}=\frac{N_k}{1-\sum_{j=1}^{k-1}p_j} = \gamma
\end{cases} $
## Equivalence to Relative Frequencies
Using the [[Probability Axioms|Probability Axiom]] of Normalization, we can show that the MLE estimator is equal to the relative sample frequency $N_j \over N$ for such a distribution.
$
\begin{rcases}
p_j = \frac{N_j}{\gamma} \\[2pt]
\displaystyle \sum_{j=1}^k p_j=1
\end{rcases} \implies \sum_{j=1}^k \frac{N_j}{\gamma}=1
$
Since summing over all $N_j$ equals $N$, it must be true that $\gamma=N$. This it follows that:
$ p_j = \frac{N_j}{\gamma} =\frac{N_j}{N}$