When addressing an X validated question on the use of the EM algorithm when estimating a Normal mean, my first comment was that it was inappropriate since there is no missing data structure to anchor by (right preposition?). However I then reflected upon the infinite number of ways to demarginalise the normal density into a joint density
∫ f(x,z;μ)dz = φ(x–μ)
from the (slice sampler) call to an indicator function for f(x,z;μ) to a joint Normal distribution with an arbitrary correlation. While the joint Normal representation produces a sequence converging to the MLE, the slice representation utterly fails as the indicator functions make any starting value of μ a fixed point for EM.
Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations.
struck me as confusing and possibly wrong since it seems to suggest to seek a maximum in both the parameter and the latent variables. Which does not produce the same value as the observed likelihood maximisation.