Multinomial Distribution - Bernhard Pfann, CFA

The multinomial distribution is a generalization of the [[Binomial Distribution]]. We have an experiment with $N$ trials, where each trail can have $K$ different outcomes (modalities), which are denoted by $a_j$. Thus, the key difference between Binomial and Multinomial is the distribution of the single trial: - *Binomial:* Many independent [[Bernoulli Distribution|Bernoulli]] trials. - *Multinomial:* Many trials from a categorical distribution. $ \begin{align} \mathrm{Ber}(p)= \begin{cases} 0& \text{w.p.} & (1-p),\\ 1& \text{w.p.} & p,\\ \end{cases} \qquad \mathrm{Categorical}(a_1,\dots,a_n)= \begin{cases} a_1 & \text{w.p.} & p_1\\ \dots& \text{w.p.} & \dots\\ a_k & \text{w.p.} & p_k\\ \end{cases} \end{align} $ Both can be written in compact form with exponents. $ \begin{align} \mathrm{Ber}(p)&= p^x*(1-p)^{(1-x)}\\[2pt] \mathrm{Categorial(a_1, \dots,a_n)}&= p_1^{\mathbf 1_{(X=a_1)}}*\dots* p_k^{\mathbf 1_{(X=a_k)}} =\prod_{j=1}^k p_j^{\mathbf 1_{(X=a_j)}} \end{align} $ ## Maximum Likelihood Estimator To construct the [[Maximum Likelihood Estimation#Estimator|MLE Estimator]] for a multinomial distribution, follow the usual steps: of finding the likelihood function, transforming to log-likelihood, and taking the [[Partial Derivative]] w.r.t. each parameter $p_j$ of the log-likelihood. *Step 1: Find likelihood function* $ \begin{align} L_N(X, \overrightarrow p)&= \prod_{i=1}^n \left(p_1^{\mathbf 1_{(X_i=a_1)}}*\dots* p_k^{\mathbf 1_{(X_i=a_k)}} \right) \tag{1}\\[4pt] &=p_1^{\sum \mathbf 1(X_i=a_1)}* \dots * p_k^{\sum \mathbf 1(X_i=a_k)} \tag{2}\\[6pt] &=p_1^{N_1}*\dots*p_k^{N_k} \tag{3}\\[4pt] \end{align} $ where: - (1) The [[Likelihood Functions|likelihood]] of a multinomial from $N$ [[Independence and Identical Distribution|i.i.d.]] trails with $k$ possible output classes, is the product of the single PMFs. - (2) We can write the product of exponentials with the same base as [[Algebra of Exponents|sum in exponents]]. - (3) The sum of indicators over all $X_i$ reflects the counts of $a_j$ within the data. We denote it by $N_j$. *Step 2: Compute log-likelihood* For the log-likelihood, we take the natural logarithm to bring down the exponents. Furthermore we can express $p_k$ as a function of all other $p_j$ as they need to sum to $1$. $ \begin{align} \ell_N(X, \overrightarrow p)&=N_1\ln(p_1)+\dots+N_k\ln(p_k) \\ &=N_1\ln(p_1)+\dots+N_k\ln(1-\sum_{j=1}^{k-1}p_j) \end{align} $ *Step 3: Taking derivatives* Take the [[Partial Derivative]] w.r.t. each parameter $p_j$ of log-likelihood and set it to zero. Since the log-likelihood has a [[Vector Operations|Vector]] of parameters $[p_1, \dots, p_k]$, we take partial derivatives for each $p_j$ separately and set them to $0$. $ \frac{\partial}{\partial p_j} :\frac{N_j}{p_j}+\frac{N_k}{1-\sum_{j=1}^{k-1}p_j}*(-1)=0 $ This creates a system of $(k-1)$ equations where the right side is always the same. For simplicity, we abbreviate it with $\gamma$. $ \begin{cases} \frac{N_1}{p_1}=\frac{N_k}{1-\sum_{j=1}^{k-1}p_j} = \gamma \\[10pt] \cdots = \cdots \\[10pt] \frac{N_k}{p_{(k-1)}}=\frac{N_k}{1-\sum_{j=1}^{k-1}p_j} = \gamma \end{cases} $ ## Equivalence to Relative Frequencies Using the [[Probability Axioms|Probability Axiom]] of Normalization, we can show that the MLE estimator is equal to the relative sample frequency $N_j \over N$ for such a distribution. $ \begin{rcases} p_j = \frac{N_j}{\gamma} \\[2pt] \displaystyle \sum_{j=1}^k p_j=1 \end{rcases} \implies \sum_{j=1}^k \frac{N_j}{\gamma}=1 $ Since summing over all $N_j$ equals $N$, it must be true that $\gamma=N$. This it follows that: $ p_j = \frac{N_j}{\gamma} =\frac{N_j}{N}$