Maximum likelihood estimator of categorical distribution

Question

The task is:

Population of the students has been divided into the following three groups:

Students with the mean of grades below 3.5

Students with the mean of grades between 3.5 and 4.5

Students with the mean of grades above 4.5

Each student in the population is described by a vector of random variables $x= (x^1\ x^2\ x^3)^T$, taking one of three possible states: $(1\ 0\ 0)^T$ if the student belongs to the first group, $(0\ 1\ 0)^T$ if the student belongs to the second group,
and $(0\ 0\ 1)^T$ if the student belongs to the third group. The distribution of $x$ is categorical distribution (also known as generalized Bernoulli distribution or Multinoulli distribution) with parameters $\theta= (\theta_1\ \theta_2\ \theta_3)^T$. From the population of the students N examples were drawn. Calculate the maximum likelihood estimator of $\theta$.

I tried to do it similarly to Bernoulli case, but I'm stuck. The idea was to find $\theta^*$ by finding the maximum of probability distribution function. So my try was

$$ M(x\mid\theta)=\prod_{d=0}^D \theta_d^{x_d}=\theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3}\\ \theta^* = \operatorname*{argmax}_\theta M(x\mid\theta) = \operatorname*{argmax}_\theta \ln(M(x\mid\theta))\\ \ln(M(x\mid\theta))= \ln(\theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3}) = x_1\ln\theta_1 + x_2\ln\theta_2 + x_3\ln\theta_3 = x^T (\ln\theta_1\ \ln\theta_2\ \ln\theta_3)^T $$

Next step would be calculating derivative with respect to $\theta$ and finding it's zero, but we don't have $\theta$ in the function.

I'm not sure where is my mistake. Or perhaps there is no mistake and it is possible to convert $(\ln\theta_1\ \ln\theta_2\ \ln\theta_3)^T$ to some form with $\theta$?

"Multinoulli" isn't a word. I think you want "multinomial". But it's a nice not-a-word and perhaps ought to be one so I haven't edited to change it. — Ethan Bolker, Apr 06 '18 at 21:04
Well, that's what's in our task :p I believe the most common name is categorical distribution. Edited. — Yksisarvinen, Apr 06 '18 at 21:26
The next step would NOT be calculating the derivative with respect to $\theta$ and finding it's zero. This is constrained maximization. You might try Lagrange multipliers, although that's not the only way. It is unfortunate that many have made the technique of finding zeros of the derivative into a knee-jerk reflex rather than something to be used consciously. — Michael Hardy, Apr 06 '18 at 21:28

score 7 · Accepted Answer · answered Apr 06 '18 at 23:44

7

Since $(\theta_1,\theta_2,\theta_3)$ must satisfy the constraint $$\theta_1+\theta_2+\theta_3 = 1,\tag 0$$ one way to do this is by Lagrange multipliers. You have $$ \operatorname{grad} (\theta_1+\theta_2+\theta_3) = (1,1,1) \tag 1 $$ and $$ \operatorname{grad} (x_1\log\theta_1 + x_2\log\theta_2 + x_3\log\theta_3) = \left( \frac{x_1}{\theta_1}, \frac{x_2}{\theta_2}, \frac{x_3}{\theta_3} \right). \tag 2 $$ So you want a value of $(\theta_1,\theta_2,\theta_3)$ for which $(2)$ is a scalar multiple of $(1).$ That happens only if the ratio $\theta_1:\theta_2:\theta_3$ is equal to the ratio $x_1:x_2:x_3.$ But the constraint $(0)$ must also hold. Consequently you get $$ \theta_1 = \frac{x_1}{x_1+x_2+x_3} $$ and similarly for the other two values of the subscript.

answered Apr 06 '18 at 23:44

Michael Hardy

1

1

I'm trying to understand the answer and I wonder if there is some notational confusion here? I believe the question is on finding the ML estimate of $\theta$ given all the examples $X = (\vec{x}^{(1)}, \ldots, \vec{x}^{(N)})$. Any single example $\vec{x}^{(i)} \in {(1, 0, 0), (0,1,0), (0,0,1)}$ so in the answer above, $\theta_1$ always becomes either 0 or 1, which doesn't make sense to me. – lemonad Oct 17 '18 at 09:54
@lemonad : By $(x_1,x_2,x_3)$ I meant the sum over all of the students. – Michael Hardy Dec 09 '19 at 16:33
@MichaelHardy How did you get the last step (Consequently you get ...)? I still can't see how the above implies the result. – Petr Peller Oct 22 '20 at 20:43
1

@PetrPeller : We seek three numbers whose ratio is $x_1:x_2:x_3$ and whose sum is $1.$ So we get: $$ \frac{x_1}{x_1+x_2+x_3}, \quad \frac{x_2}{x_1+x_2+x_3}, \quad \frac{x_3}{x_1+x_2+x_3}. $$ – Michael Hardy Oct 23 '20 at 19:55

score 5 · Answer 2 · edited Mar 22 '22 at 09:15

The accepted answer does not infer the solution from the MLE approach rigorously, so for the completeness sake, I will write the full path to the solutions $$\theta_1 = \frac{1}{n} \sum_{i=1}^n x_{i1} \\ \theta_2 = \frac{1}{n} \sum_{i=1}^n x_{i2}$$ ($\theta_3 = 1 - \theta_1 - \theta_2$ is not needed) in the following without the use of a Langrange multiplier:

Assume $X_1, \dots, X_n \overset{iid}\sim M(\theta_1, \theta_2)$ with $X_i = (X_{i1}, X_{i2}, 1 - X_{i1} - X_{i2})^T$.
The tricky part is to see that each $X_i$ only consists of two random variables (consequently $M$ also only has two parameters, i.e. we can ignore $\theta_3$).
The log likelihood function of the observations $x_1, \dots, x_n$ is $$l(\theta) = \ln P(X_1 = x_1, \dots, X_n = x_n; \theta_1, \theta_2) \\ \overset{iid} = \ln \prod_{i=1}^n P(X_i = x_i; \theta_1, \theta_2) \\ = \sum_{i=1}^n \ln P(X_i = x_i; \theta_1, \theta_2) \\ = \sum_{i=1}^n x_{i1} \ln \theta_1 + x_{i2} \ln \theta_2 + (1 - x_{i1} - x_{i2}) \ln (1 - \theta_1 - \theta_2).$$ Then, the score function is $$ s(\theta) = \nabla_\theta l(\theta) = (\sum_{i=1}^n (\frac{x_{i1}}{\theta_1} - \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}), \sum_{i=1}^n (\frac{x_{i2}}{\theta_2} - \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}))^T.$$ Setting it to zero gives $$\sum_{i=1}^n \frac{x_{i1}}{\theta_1} = \sum_{i=1}^n \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}$$ $$ \sum_{i=1}^n (x_{i1} - \theta_1 x_{i1} - \theta_2 x_{i1}) = \sum_{i=1}^n (\theta_1 - \theta_1 x_{i1} - \theta_1 x_{i2})$$ $$\theta_1 = \frac{\sum_{i=1}^n (x_{i1} - \theta_2 x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} $$ and analogous with inserted $\theta_1$ $$\theta_2 = \frac{\sum_{j=1}^n (x_{j1} - \theta_2 x_{j1})}{\sum_{i=1}^n (1 - x_{i2})} \\ = \frac{\sum_{j=1}^n (x_{j2} - \frac{\sum_{i=1}^n (x_{i1} - \theta_2 x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} x_{j2})}{\sum_{i=1}^n (1 - x_{i1})} \\ = \frac{\sum_{j=1}^n x_{j2}}{\sum_{i=1}^n (1 - x_{i1})} - \frac{ \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{i=1}^n (1 - x_{i1}) \sum_{j=1}^n (1 - x_{j2})} + \frac{\theta_2 \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{i=1}^n (1 - x_{i1}) \sum_{j=1}^n (1 - x_{j2})}$$ minus the last fraction gives $$\theta_2 \frac{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2})} = \frac{\sum_{j=1}^n x_{j2}}{\sum_{k=1}^n (1 - x_{k1})} - \frac{\sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2})}$$ dividing by the fraction of the left side gives $$\theta_2 = \\ \frac{\sum_{j=1}^n x_{j2} \sum_{l=1}^n(1-x_{l2})}{ \sum_{k=1}^n (1-x_{k1}) \sum_{l=1}^n (1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}} - \frac{\sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{ \sum_{k=1}^n (1-x_{k1}) \sum_{l=1}^n (1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}} \\ = \frac{\sum_{j=1}^n x_{j2} \sum_{l=1}^n(1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{n^2 - n \sum_{k=1}^n x_{k1} - n \sum_{l=1}^n x_{l2}} \\ = \frac{\sum_{j=1}^n x_{j2} (n - \sum_{l=1}^n x_{l2} - \sum_{i=1}^n x_{i1})}{n(n - \sum_{k=1}^n x_{k1} - \sum_{l=1}^n x_{l2}} \\ = \frac{1}{n} \sum_{j=1}^n x_{j2} $$

and inserting back into $\theta_1$ gives $$\theta_1 = \frac{\sum_{i=1}^n (x_{i1} - \frac{1}{n} \sum_{j=1}^n x_{j2}x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} = \frac{\sum_{i=1}^n x_{i1} (1 - \frac{1}{n} \sum_{j=1}^n x_{j2})}{n(1 - \frac{1}{n} \sum_{i=1}^n x_{i2})} = \frac{1}{n} \sum_{i=1}^n x_{i1}.$$

Maximum likelihood estimator of categorical distribution

2 Answers2

Linked