Expectation-Maximization (EM) algorithm for Linear Mixed-Effect Model (LMM)

Model Description

The relation between OLM and LMM can be found in The Introduction of Linear Mixed-Effect Model.

The linear mixed-effect model (LMM) is a statistical model that accounts for both fixed effects and random effects in a linear regression model. It is used for modeling data where observations are not independent or identically distributed.

Consider a dataset {y,X,Z}\{\mathbf{y}, \mathbf{X},\mathbf{Z}\} with nn samples, where yRn\mathbf{y} \in \mathbb{R}^n is the vector of response variable, XRn×p\mathbf{X} \in \mathbb{R}^{n \times p} is the matrix of pp independent variables, and ZRn×c\mathbf{Z} \in \mathbb{R}^{n \times c} is another matrix of cc variables. The linear mixed model builds upon a linear relationship from y\mathbf{y} to X\mathbf{X} and Z\mathbf{Z} by
y=Zωfixed+Xβrandom+eerror,\begin{equation} \mathbf{y} = \underbrace{\mathbf{Z}\mathbf{\omega}}_{\text {fixed}} + \underbrace{\mathbf{X}\mathbf{\beta}}_{\text {random}} + \underbrace{\mathbf{e}}_{\text {error}}, \end{equation}
where ωRc\mathbf{\omega} \in \mathbb{R}^c is the vector of fixed effects, βRp\mathbf{\beta} \in \mathbb{R}^p is the vector of random effects with βN(0,σβ2Ip)\mathbf{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2_\mathbf{\beta} \mathbf{I}_p), and eN(0,σe2In)\mathbf{e} \sim \mathcal{N}(\mathbf{0}, \sigma^2_e \mathbf{I}_n) is the independent noise term.

Let Θ\mathbf{\Theta} denote the set of unknown parameters Θ={ω,σβ2,σe2}\mathbf{\Theta} = \{\mathbf{\omega}, \sigma^2_\mathbf{\beta}, \sigma^2_e\}. Under the framework of EM algorithm, we can treat β\mathbf{\beta} as a latent variable. Below is the directed acyclic graph below for our model.

Directed acyclic graph
Figure 1. The directed acyclic graph for the linear mixed-effect model. The shaded nodes are observed variables and the unshaded nodes are latent variables. The arrows indicate the conditional dependencies. The points indicate the parameters. The plate indicates that the variables inside the plate are replicated n times.

The EM algorithm is an iterative algorithm that alternates between the E-step and the M-step until convergence. In the following sections, we will first find the posterior distribution of β\mathbf{\beta} given y\mathbf{y} and Θ^\hat{\mathbf{\Theta}} in the E-step and calculate the QQ function. Then we will find the ML estimator of Θ\mathbf{\Theta} which is used in M-step. Finally, we will calculate and track the marginal data log-likelihood (Θ)\ell(\mathbf{\Theta}) to check the convergence of the algorithm.

Statistical Inference in LMM

Complete Data Log-Likelihood

The complete data log-likelihood is given by
(Θ,β)=logp(y,βΘ)=logp(yβ,Θ)+logp(βΘ)=n2log(2π)n2logσe212σe2yZωXβ2p2log(2π)p2logσβ212σβ2βTβ\begin{equation} \begin{split} \ell(\mathbf{\Theta}, \mathbf{\beta}) &= \log p(\mathbf{y}, \mathbf{\beta}| \mathbf{\Theta})\\ &= \log p(\mathbf{y}| \mathbf{\beta}, \mathbf{\Theta}) + \log p(\mathbf{\beta}| \mathbf{\Theta})\\ &= -\frac{n}{2} \log (2\pi) -\frac{n}{2} \log \sigma_e^2 - \frac{1}{2\sigma_e^2} \|\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta}\|^2\\ &\quad -\frac{p}{2} \log (2\pi) -\frac{p}{2} \log \sigma_\beta^2 - \frac{1}{2\sigma_\beta^2} \mathbf{\beta}^T \mathbf{\beta} \end{split}\end{equation}

Statistical Inference in E-Step

The E-step is to compute the expectation of the complete data log-likelihood with respect to the conditional distribution of β\mathbf{\beta} given y\mathbf{y} and Θ\mathbf{\Theta}:
Q(ΘΘold)=Eβy,Θold(Θ,β)\begin{equation} Q(\mathbf{\Theta}|\mathbf{\Theta}^{\text{old}}) = \mathbb{E}_{\mathbf{\beta}|\mathbf{y}, \mathbf{\Theta}^{\text{old}}} \ell(\mathbf{\Theta}, \mathbf{\beta}) \end{equation}

Therefore, we need to find the posterior distribution of β\mathbf{\beta} given y\mathbf{y} and Θ\mathbf{\Theta}. However the posterior distribution of β\mathbf{\beta} can not be obtained directly. Instead, by applying Bayes' rule, we have
p(βy,Θ)=p(yβ,Θ)p(βΘ)p(yΘ)p(yβ,Θ)p(βΘ)exp{12(yZωXβ)T(σe2In)1(yZωXβ)}×exp{12βT(σβ2Ip)1β}exp{12(βμ)TΓ1(βμ)}\begin{equation} \begin{split} p(\mathbf{\beta}| \mathbf{y}, \mathbf{\Theta}) &= \frac{p(\mathbf{y}| \mathbf{\beta}, \mathbf{\Theta}) p(\mathbf{\beta}| \mathbf{\Theta})}{p(\mathbf{y}| \mathbf{\Theta})}\\ &\propto p(\mathbf{y}| \mathbf{\beta}, \mathbf{\Theta}) p(\mathbf{\beta}| \mathbf{\Theta})\\ &\propto \exp \left\{-\frac{1}{2}(\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta})^T (\sigma_e^2\mathbf{I}_n)^{-1} (\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta}) \right\} \\ &\quad \times \exp \left\{-\frac{1}{2} \mathbf{\beta}^T (\sigma_\beta^2 \mathbf{I}_p)^{-1} \mathbf{\beta} \right\} \\ &\propto \exp \left\{-\frac{1}{2}\left( \mathbf{\beta} - \mathbf{\mu}\right)^T \mathbf{\Gamma}^{-1} \left( \mathbf{\beta} - \mathbf{\mu}\right) \right\} \end{split}\end{equation}

where Γ=(XTXσe2+Ipσβ2)1\mathbf{\Gamma} = \left(\frac{\mathbf{X}^T \mathbf{X}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} and μ=ΓXT(yZω)/σe2\mathbf{\mu} = \mathbf{\Gamma} \mathbf{X}^T (\mathbf{y}-\mathbf{Z}\mathbf{\omega})/\sigma_e^2. And, the posterior distribution of β\mathbf{\beta} is βy,ΘN(μ,Γ)\mathbf{\beta}|\mathbf{y}, \mathbf{\Theta} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Gamma}). [1]

Thus, we have
Q(ΘΘold)=Eβy,Θold[logp(y,βΘ)]=Eβy,Θold[logp(yβ,Θ)+logp(βΘ)]=Eβy,Θold[n2log(2π)n2logσe212σe2yZωXβ2p2log(2π)p2logσβ212σβ2βTβ]=n+p2log(2π)n2logσe2p2logσβ212σβ2tr(Γ)12σβ2μTμ12σe2(yZω2+tr(XΓXT)+μTXTXμ2(yZω)TXμ)\begin{equation} \begin{split} Q(\Theta|\Theta^{\text{old}}) &= \mathbb{E}_{\mathbf{\beta}|\mathbf{y}, \Theta^{\text{old}}} \left[ \log p(\mathbf{y}, \mathbf{\beta}|\Theta) \right]\\ &= \mathbb{E}_{\mathbf{\beta}|\mathbf{y}, \Theta^{\text{old}}} \left[ \log p(\mathbf{y}| \mathbf{\beta}, \mathbf{\Theta}) + \log p(\mathbf{\beta}| \Theta) \right]\\ &= \mathbb{E}_{\mathbf{\beta}|\mathbf{y}, \Theta^{\text{old}}} \left[ -\frac{n}{2} \log (2\pi) -\frac{n}{2} \log \sigma_e^2 - \frac{1}{2\sigma_e^2} \|\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta}\|^2 \right.\\ &\quad \left. -\frac{p}{2} \log (2\pi) -\frac{p}{2} \log \sigma_\beta^2 - \frac{1}{2\sigma_\beta^2} \mathbf{\beta}^T \mathbf{\beta} \right]\\ &= -\frac{n+p}{2} \log (2\pi) -\frac{n}{2} \log \sigma_e^{2}-\frac{p}{2} \log \sigma_\beta^2 - \frac{1}{2\sigma_\beta^2} \text{tr}(\mathbf{\Gamma}) - \frac{1}{2\sigma_\beta^2} \mathbf{\mu}^{T} \mathbf{\mu}\\ &\quad - \frac{1}{2\sigma_e^{2}} \left(\|\mathbf{y} - \mathbf{Z}\mathbf{\omega}\|^2 + \text{tr}(\mathbf{X}\mathbf{\Gamma} \mathbf{X}^T) + \mathbf{\mu}^T\mathbf{X}^T \mathbf{X}\mathbf{\mu}- 2(\mathbf{y} - \mathbf{Z}\mathbf{\omega})^T \mathbf{X}\mathbf{\mu} \right) \end{split}\end{equation}

Note that E[βTβ]=tr(V[β])+E[β]TE[β]=tr(Γ)+μTμ\mathbb{E}[\mathbf{\beta}^T \mathbf{\beta}] = \text{tr}(\mathbb{V}[\mathbf{\beta}]) + \mathbb{E}[\mathbf{\beta}]^T \mathbb{E}[\mathbf{\beta}] = \text{tr}(\mathbf{\Gamma}) + \mathbf{\mu}^T \mathbf{\mu} and E[(Xβ)T(Xβ)]=tr(XΓXT)+(Xμ)T(Xμ).\mathbb{E}[(\mathbf{X}\mathbf{\beta})^T (\mathbf{X}\mathbf{\beta})] = \text{tr}(\mathbf{X}\mathbf{\Gamma} \mathbf{X}^T) + (\mathbf{X}\mathbf{\mu})^T (\mathbf{X}\mathbf{\mu}).

Statistical Inference in M-Step

M-step is to maximize the QQ function with respect to Θ\mathbf{\Theta}, which is equivalent to set the partial derivatives of QQ with respect to Θ\mathbf{\Theta} to zero. Thus, we have
Qω=1σe2ZT(yZωXμ)=:0ω^=(ZTZ)1ZT(yXμ)Qσβ2=p2σβ2+12σβ4(tr(Γ)+μTμ)=:0σ^β2=tr(Γ)+μTμpQσe2=n2σe2+12σe4(yZω2+tr(XΓXT)+μTXTXμ2(yZω)TXμ)=:0σ^e2=1n(yZω2+tr(ΓXTX)+μTXTXμ2(yZω)TXμ)\begin{equation} \begin{split} \frac{\partial Q}{\partial \mathbf{\omega}} &= -\frac{1}{\sigma_e^2} \mathbf{Z}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\mu}) =: 0\\ \Rightarrow \hat{\mathbf{\omega}} &= (\mathbf{Z}^T \mathbf{Z})^{-1} \mathbf{Z}^T (\mathbf{y} - \mathbf{X}\mathbf{\mu})\\ \frac{\partial Q}{\partial \sigma_\beta^2} &= -\frac{p}{2\sigma_\beta^2} + \frac{1}{2\sigma_\beta^4} \left(\text{tr}(\mathbf{\Gamma}) + \mathbf{\mu}^T \mathbf{\mu} \right) =: 0\\ \Rightarrow \hat{\sigma}_\beta^2 &= \frac{\text{tr}(\mathbf{\Gamma}) + \mathbf{\mu}^T \mathbf{\mu}}{p}\\ \frac{\partial Q}{\partial \sigma_e^2} &= -\frac{n}{2\sigma_e^2} + \frac{1}{2\sigma_e^4} \left(\|\mathbf{y} - \mathbf{Z}\mathbf{\omega}\|^2 + \text{tr}(\mathbf{X}\mathbf{\Gamma} \mathbf{X}^T) + \mathbf{\mu}^T\mathbf{X}^T \mathbf{X}\mathbf{\mu}- 2(\mathbf{y} - \mathbf{Z}\mathbf{\omega})^T \mathbf{X}\mathbf{\mu} \right) =: 0\\ \Rightarrow \hat{\sigma}_e^2 &= \frac{1}{n} \left( \|\mathbf{y}-\mathbf{Z}\mathbf{\omega}\|^2 + \text{tr}\left(\mathbf{\Gamma} \mathbf{X}^T\mathbf{X}\right)+ \mathbf{\mu}^T\mathbf{X}^T\mathbf{X}\mathbf{\mu} - 2(\mathbf{y}-\mathbf{Z}\mathbf{\omega})^T \mathbf{X}\mathbf{\mu}\right) \end{split}\end{equation}

Statistical Inference in Incomplete Data Log-Likelihood

Let Σ=σβ2XXT+σe2In\mathbf{\Sigma} = \sigma_\beta^2 \mathbf{X}\mathbf{X}^T + \sigma_e^2 \mathbf{I}_n be the covariance matrix of y\mathbf{y}, then the conditional distribution of yΘ\mathbf{y}|\mathbf{\Theta} is N(Zω,Σ)\mathcal{N}(\mathbf{Z}\mathbf{\omega}, \mathbf{\Sigma}). The incomplete data log-likelihood is given by
(Θ)=logp(yΘ)=n2log(2π)12logΣ12(yZω)TΣ1(yZω)\begin{equation} \begin{split} \ell(\mathbf{\Theta}) &=\log p(\mathbf{y}| \mathbf{\Theta})\\ &= -\frac{n}{2} \log(2\pi)-\frac{1}{2} \log |\mathbf{\Sigma}| - \frac{1}{2} (\mathbf{y} - \mathbf{Z}\mathbf{\omega})^T \mathbf{\Sigma}^{-1} (\mathbf{y} - \mathbf{Z}\mathbf{\omega}) \end{split}\end{equation}

When Δ=(Θ,β)(Θold,β)<ε|\Delta \ell| = |\ell(\mathbf{\Theta}, \mathbf{\beta}) - \ell(\mathbf{\Theta}^{\text{old}}, \mathbf{\beta})| < \varepsilon, the algorithm is considered to be converged, where ε\varepsilon is a small number.

The EM Algorithm: n >= p

E-step

The E-step is to compute the following expectations:

β^=μold\begin{equation} \hat{\mathbf{\beta}}= \mathbf{\mu}^{\text{old}} \end{equation}

where Γ=(XTXσe2+Ipσβ2)1\mathbf{\Gamma} = \left(\frac{\mathbf{X}^T \mathbf{X}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} and μ=ΓXT(yZω)/σe2\mathbf{\mu} = \mathbf{\Gamma} \mathbf{X}^T (\mathbf{y}-\mathbf{Z}\mathbf{\omega})/\sigma_e^2.

To simplify the calculation of Γ\mathbf{\Gamma}, we use the eigenvalue decomposition of XTX\mathbf{X}^T \mathbf{X} to accelerate the calculation of Γ\mathbf{\Gamma}. Let XTX=QΛQT\mathbf{X}^T \mathbf{X} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T, where Q\mathbf{Q} is an orthogonal matrix and Λ=diag{λ1,,λp}\mathbf{\Lambda}= \text{diag} \{\lambda_1, \dots, \lambda_p\} is a diagonal matrix with the eigenvalues of XTX\mathbf{X}^T \mathbf{X} on the diagonal. Then we have
Γ=(XTXσe2+Ipσβ2)1=(QΛQTσe2+Ipσβ2)1=Q(Λσe2+Ipσβ2)1QT\begin{equation} \begin{split} \mathbf{\Gamma} &= \left(\frac{\mathbf{X}^T \mathbf{X}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1}\\ &= \left(\frac{\mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1}\\ &= \mathbf{Q} \left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \mathbf{Q}^T \end{split}\end{equation}

where (Λσe2+Ipσβ2)1\left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} is a diagonal matrix. The inverse of a diagonal matrix is the reciprocal of the diagonal elements.

M-step

The M-step is to maximize the expectation of the complete data log-likelihood with respect to Θ\mathbf{\Theta}:
Θ=argmaxΘQ(ΘΘold)\begin{equation} \mathbf{\Theta} = \arg\max_\mathbf{\Theta} Q(\mathbf{\Theta}|\mathbf{\Theta}^{\text{old}}) \end{equation}
We have
ω^=(ZTZ)1ZT(yXβ)σ^β2=1p(tr(Γ)+β2)σ^e2=1n(yZω^2+tr(ΓXTX)+Xβ22(yZω^)TXβ)\begin{equation} \begin{split} \hat{\mathbf{\omega}} &= (\mathbf{Z}^T \mathbf{Z})^{-1} \mathbf{Z}^T (\mathbf{y} - \mathbf{X}\mathbf{\beta})\\ \hat{\mathbf{\sigma}}_\beta^{2} &= \frac{1}{p} \left(\text{tr}(\mathbf{\Gamma}) + \| \mathbf{\beta}\|^2 \right)\\ \hat{\mathbf{\sigma}}_e^{2} &= \frac{1}{n}\left( \|\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}}\|^2 + \text{tr}\left(\mathbf{\Gamma}\mathbf{X}^T\mathbf{X}\right)+ \|\mathbf{X}\mathbf{\beta} \|^2 - 2(\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}})^T \mathbf{X}\mathbf{\beta}\right) \end{split}\end{equation}

where
tr(Γ)=tr(Q(Λσe2+Ipσβ2)1QT)=tr((Λσe2+Ipσβ2)1)=i=1pσβ2σe2σe2+σβ2λitr(ΓXTX)=tr(Q(Λσe2+Ipσβ2)1QTQΛQT)=tr(Q(Λσe2+Ipσβ2)1ΛQT)=i=1pλiσβ2σe2σe2+σβ2λi\begin{equation} \begin{split} \text{tr}\left(\mathbf{\Gamma}\right)&= \text{tr}\left(\mathbf{Q} \left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \mathbf{Q}^T\right)\\ &= \text{tr}\left( \left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \right)\\ &= \sum_{i=1}^p \frac{\sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \lambda_i}\\ \text{tr}\left(\mathbf{\Gamma}\mathbf{X}^T\mathbf{X}\right) &= \text{tr}\left(\mathbf{Q} \left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \mathbf{Q}^T \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T\right)\\ &= \text{tr}\left(\mathbf{Q} \left(\frac{\mathbf{\Lambda}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \mathbf{\Lambda} \mathbf{Q}^T\right)\\ &= \sum_{i=1}^p \frac{\lambda_i \sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \lambda_i} \end{split}\end{equation}

Incomplete Data Log-Likelihood

However, we can not calculate the logΣ\log |\mathbf{\Sigma}| and Σ1\mathbf{\Sigma}^{-1} in likelihood directly since they would cause numerical overflow[2]. Instead, we use the eigenvalue decomposition of XXT\mathbf{X}\mathbf{X}^T to calculate the log determinant of Σ\mathbf{\Sigma}[3]. Let XXT=Q~Λ~Q~T\mathbf{X}\mathbf{X}^T = \tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \tilde{\mathbf{Q}}^T, where Q~\tilde{\mathbf{Q}} is an orthogonal matrix and Λ~=diag{λ~1,,λ~n}\tilde{\mathbf{\Lambda}}= \text{diag} \{ \tilde{\lambda}_1, \dots, \tilde{\lambda}_n\} is a diagonal matrix with the eigenvalues of XXT\mathbf{X}\mathbf{X}^T on the diagonal. Then we have
logΣ=logσβ2XXT+σe2In=i=1nlog(σβ2λ~i+σe2)Σ1=(σβ2XXT+σe2In)1=(σβ2Q~Λ~Q~T+σe2In)1=Q~(σβ2Λ~+σe2In)1Q~T\begin{equation} \begin{split} \log |\Sigma| &= \log |\sigma_\beta^2 \mathbf{X}\mathbf{X}^T + \sigma_e^2 \mathbf{I}_n|\\ &= \sum_{i=1}^n \log (\sigma_\beta^2 \tilde{\lambda}_i + \sigma_e^2)\\ \Sigma^{-1} &= \left(\sigma_\beta^2 \mathbf{X}\mathbf{X}^T + \sigma_e^2 \mathbf{I}_n\right)^{-1}\\ &= \left(\sigma_\beta^2 \tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \tilde{\mathbf{Q}}^T + \sigma_e^2 \mathbf{I}_n\right)^{-1}\\ &= \tilde{\mathbf{Q}} \left(\sigma_\beta^2 \tilde{\mathbf{\Lambda}} + \sigma_e^2 \mathbf{I}_n\right)^{-1} \tilde{\mathbf{Q}}^T \end{split}\end{equation}

where (σβ2Λ~+σe2In)1\left(\sigma_\beta^2 \tilde{\mathbf{\Lambda}} + \sigma_e^2 \mathbf{I}_n\right)^{-1} is a diagonal matrix. The inverse of a diagonal matrix is the reciprocal of the diagonal elements.

Therefore, incomplete data log-likelihood is given by
(Θ)=n2log(2π)12logtr(σβ2Λ~+σe2In)12(yZω)TQ~(σβ2Λ~+σe2In)1Q~T(yZω)\begin{equation} \begin{split} \ell(\mathbf{\Theta}) &= -\frac{n}{2} \log(2\pi)-\frac{1}{2} \log \text{tr}\left(\sigma_\beta^2\tilde{\mathbf{\Lambda}} + \sigma_e^2 \mathbf{I}_n\right) \\&\quad- \frac{1}{2} (\mathbf{y} - \mathbf{Z}\mathbf{\omega})^T \tilde{\mathbf{Q}} \left(\sigma_\beta^2 \tilde{\mathbf{\Lambda}} + \sigma_e^2 \mathbf{I}_n\right)^{-1} \tilde{\mathbf{Q}}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega}) \end{split}\end{equation}

where Q~\tilde{\mathbf{Q}} is the orthogonal matrix in the eigenvalue decomposition of XXT\mathbf{X}\mathbf{X}^T and Λ~\tilde{\mathbf{\Lambda}} is the diagonal matrix with the eigenvalues of XXT\mathbf{X}\mathbf{X}^T on the diagonal.

Pseudocode

The EM algorithm is an iterative algorithm that alternates between the E-step and the M-step until convergence. The algorithm is summarized as follows:

  1. Initialize Θ(0)\mathbf{\Theta}^{(0)}:
    ω(0)=(ZTZ)1ZTy\mathbf{\omega}^{(0)}=\left(\mathbf{Z}^T \mathbf{Z}\right)^{-1} \mathbf{Z}^T \mathbf{y}, σβ2(0)=σe2(0)=V(yZω)/2\sigma_\beta^{2(0)} = \sigma_e^{2(0)} = \mathbb{V}(\mathbf{y}-\mathbf{Z\omega})/2.
  2. For t=0,1,t = 0, 1, \dots, MAX_ITERATION-1:
    1. E-step: Estimate β\mathbf{\beta}.
      Γ(t)=Q(Λ(t)σe2(t)+Ipσβ2(t))1QTβ^(t+1)=Γ(t)XT(yZω(t))/σe2(t)\begin{equation}\begin{split} \mathbf{\Gamma}^{(t)} &= \mathbf{Q} \left(\frac{\mathbf{\Lambda}^{(t)}}{\sigma_e^{2(t)}} + \frac{\mathbf{I}_p}{\sigma_\beta^{2(t)}} \right)^{-1} \mathbf{Q}^T\\ \hat{\mathbf{\beta}}^{(t+1)} &= \mathbf{\Gamma}^{(t)} \mathbf{X}^T (\mathbf{y}-\mathbf{Z}\mathbf{\omega}^{(t)})/\sigma_e^{2(t)} \end{split}\end{equation}
    2. M-step: Estimate Θ^={ω^,σ^β2,σ^e2}\hat{\mathbf{\Theta}}=\{\hat{\mathbf{\omega}}, \hat{\sigma}_\beta^{2}, \hat{\sigma}_e^{2}\}.
      ω^(t+1)=(ZTZ)1ZT(yXβ(t+1))σ^β2(t+1)=1p(i=1p(λi(t)/σe2(t)+1/σβ2(t))1+β(t+1)2)σ^e2(t+1)=1n(yZω^(t+1)2+i=1pλi(t)σe2(t)σβ2(t)σe2(t)+σβ2(t)λi(t)+Xβ(t+1)22(yZω^(t+1))TXβ(t+1))\begin{equation} \begin{split} \hat{\mathbf{\omega}}^{(t+1)} &= (\mathbf{Z}^T \mathbf{Z})^{-1} \mathbf{Z}^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}^{(t+1)})\\ \hat{\sigma}_\beta^{2(t+1)} &= \frac{1}{p} \left(\sum_{i=1}^p \left({\lambda}_i^{(t)}/\sigma_e^{2(t)} +1/\sigma_\beta^{2(t)}\right)^{-1} + \|\mathbf{\beta}^{(t+1)}\|^2 \right)\\ \hat{\sigma}_e^{2(t+1)} &= \frac{1}{n}\left( \|\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}}^{(t+1)}\|^2 + \sum_{i=1}^p \frac{{\lambda}_i^{(t)} \sigma_e^{2(t)} \sigma_\beta^{2(t)}}{\sigma_e^{2(t)} + \sigma_\beta^{2(t)} {\lambda}_i^{(t)}}\right.\\ &\left.\quad +\|\mathbf{X}\mathbf{\beta}^{(t+1)} \|^2 - 2(\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}}^{(t+1)})^T \mathbf{X}\mathbf{\beta}^{(t+1)}\right) \end{split}\end{equation}
    3. Calculate (Θ(t+1))\ell(\mathbf{\Theta}^{(t+1)}):
      (Θ(t+1))=n2log(2π)12logi=1n(σβ2(t+1)λ~i(t)+σe2(t+1))12(yZω(t+1))TQ~(σβ2(t+1)Λ~(t)+σe2(t+1)In)1Q~T(yZω(t+1))\begin{equation} \begin{split} \ell(\mathbf{\Theta}^{(t+1)}) &= -\frac{n}{2} \log(2\pi)-\frac{1}{2} \log\sum_{i=1}^n \left(\sigma_\beta^{2(t+1)} \tilde{\lambda}_i^{(t)} + \sigma_e^{2(t+1)} \right) \\ &\quad-\frac{1}{2} (\mathbf{y} - \mathbf{Z}\mathbf{\omega}^{(t+1)})^T \tilde{\mathbf{Q}} \left(\sigma_\beta^{2(t+1)} \tilde{\mathbf{\Lambda}}^{(t)} + \sigma_e^{2(t+1)} \mathbf{I}_n\right)^{-1} \tilde{\mathbf{Q}}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega}^{(t+1)}) \end{split} \end{equation}
    4. Check Δ=(Θ(t+1))(Θ(t))<ε|\Delta \ell|=|\ell(\mathbf{\Theta}^{(t+1)}) - \ell(\mathbf{\Theta}^{(t)})| < \varepsilon for convergence. If converged, stop. Otherwise, continue.
  3. Return results.

The EM Algorithm: n < p

Since XTX\mathbf{X}^T \mathbf{X} is a p×pp \times p matrix and XXT\mathbf{X} \mathbf{X}^T is a n×nn \times n matrix, when n<pn < p, it's easier to calculate the inverse of the latter. Therefore, we use the Woodbury matrix identity to simplify the calculation of Γ\mathbf{\Gamma}.

Woodbury Matrix Identity

(A+UCV)1=A1A1U(C1+VA1U)1VA1(I+UV)1U=U(I+VU)1(push-through identity)\begin{equation} \begin{split} \left(\mathbf{A} + \mathbf{U} \mathbf{C} \mathbf{V}\right)^{-1} &= \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} \left(\mathbf{C}^{-1} + \mathbf{V} \mathbf{A}^{-1} \mathbf{U}\right)^{-1} \mathbf{V} \mathbf{A}^{-1}\\ \Rightarrow \left(\mathbf{I}+\mathbf{UV}\right)^{-1} \mathbf{U} &= \mathbf{U} \left(\mathbf{I} + \mathbf{VU}\right)^{-1} \text{\small (push-through identity)} \end{split}\end{equation}

Let U=σβ2XTσe2\mathbf{U} = \frac{\sigma_\beta^2\mathbf{X}^T}{\sigma_e^2} and V=X\mathbf{V} = \mathbf{X}, we have
(XTXσe2+Ipσβ2)1XT=XT(XXTσe2+Inσβ2)1\begin{equation} \left(\frac{\mathbf{X}^T \mathbf{X}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1} \mathbf{X}^T = \mathbf{X}^T \left(\frac{\mathbf{X} \mathbf{X}^T}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} \end{equation}

Let
Γ~=(XXTσe2+Inσβ2)1\begin{equation} \tilde{\mathbf{\Gamma}} = \left(\frac{\mathbf{X} \mathbf{X}^T}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} \end{equation}
then we can use the eigenvalue decomposition of XXT\mathbf{X} \mathbf{X}^T to accelerate the calculation of Γ~\tilde{\mathbf{\Gamma}}. Also, it's obvious that XTΓ~=ΓXT\mathbf{X}^T \tilde{\mathbf{\Gamma}} = \mathbf{\Gamma}\mathbf{X}^T and
β^=μ~=XTΓ~(yZω)/σe2\begin{equation} \hat{\mathbf{\beta}}=\tilde{\mathbf{\mu}}=\mathbf{X}^T \tilde{\mathbf{\Gamma}} (\mathbf{y}-\mathbf{Z}\mathbf{\omega})/\sigma_e^2 \end{equation}

Eigenvalue decomposition

Let XXT=Q~Λ~Q~T\mathbf{X} \mathbf{X}^T = \tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \tilde{\mathbf{Q}}^T, where Q~\tilde{\mathbf{Q}} is an orthogonal matrix and Λ~\tilde{\mathbf{\Lambda}} is a diagonal matrix with the eigenvalues of XXT\mathbf{X} \mathbf{X}^T on the diagonal. Then we have
Γ~=(Q~Λ~Q~Tσe2+Inσβ2)1=Q~(Λ~σe2+Inσβ2)1Q~T\begin{equation} \tilde{\mathbf{\Gamma}} = \left(\frac{\tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \tilde{\mathbf{Q}}^T}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} = \tilde{\mathbf{Q}} \left(\frac{\tilde{\mathbf{\Lambda}}}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} \tilde{\mathbf{Q}}^T \end{equation}

where (Λ~σe2+Inσβ2)1\left(\frac{\tilde{\mathbf{\Lambda}}}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} is a diagonal matrix. The inverse of a diagonal matrix is the reciprocal of the diagonal elements.

We can update tr(XXTΓ)\text{tr}(\mathbf{X}\mathbf{X}^T \mathbf{\Gamma}) as follows:
tr(XXTΓ)=tr(Q~Λ~Q~TQ~(Λ~σe2+Inσβ2)1Q~T)=tr(Q~Λ~(Λ~σe2+Inσβ2)1Q~T)=i=1nλ~iσβ2σe2σe2+σβ2λ~i\begin{equation}\begin{split} \text{tr}(\mathbf{X}\mathbf{X}^T \mathbf{\Gamma}) &= \text{tr}\left(\tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \tilde{\mathbf{Q}}^T \tilde{\mathbf{Q}} \left(\frac{\tilde{\mathbf{\Lambda}}}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} \tilde{\mathbf{Q}}^T\right)\\ &= \text{tr}\left(\tilde{\mathbf{Q}} \tilde{\mathbf{\Lambda}} \left(\frac{\tilde{\mathbf{\Lambda}}}{\sigma_e^2} + \frac{\mathbf{I}_n}{\sigma_\beta^2} \right)^{-1} \tilde{\mathbf{Q}}^T\right)\\ &= \sum_{i=1}^n \frac{\tilde{\lambda}_i \sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \tilde{\lambda}_i} \end{split}\end{equation}

Since the rank of XXT\mathbf{X} \mathbf{X}^T and XTX\mathbf{X}^T \mathbf{X} are equal (to the number of non-zero eigenvalues), when n<pn < p, we have
tr(Γ)=i=1pσβ2σe2σe2+σβ2λi=i=1nσβ2σe2σe2+σβ2λ~i+(pn)σβ2σβ2=1p(tr(Γ)+μ2)=1p(tr(Γ~)+μ~2+(pn)σβ2)\begin{equation} \begin{split} \text{tr}(\mathbf{\Gamma})&=\sum_{i=1}^p \frac{\sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \lambda_i}\\ &= \sum_{i=1}^n \frac{\sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \tilde{\lambda}_i} + (p-n) \sigma_\beta^2\\ \Rightarrow\sigma_\beta^2&=\frac{1}{p} \left(\text{tr}(\mathbf{\Gamma}) + \| \mathbf{\mu}\|^2 \right) \\ &= \frac{1}{p} \left(\text{tr}(\tilde{\mathbf{\Gamma}}) + \| \tilde{\mathbf{\mu}}\|^2 + (p-n) \sigma_\beta^2 \right) \end{split} \end{equation}

Pseudocode

The EM algorithm when n<pn<p is summarized as follows:

  1. Initialize Θ(0)\mathbf{\Theta}^{(0)}:
    ω(0)=(ZTZ)1ZTy\mathbf{\omega}^{(0)}=\left(\mathbf{Z}^T \mathbf{Z}\right)^{-1} \mathbf{Z}^T \mathbf{y}, σβ2(0)=σe2(0)=V(yZω)/2\sigma_\beta^{2(0)} = \sigma_e^{2(0)} = \mathbb{V}(\mathbf{y}-\mathbf{Z\omega})/2.
  2. For t=0,1,t = 0, 1, \dots, MAX_ITERATION-1:
    1. E-step: Estimate β\mathbf{\beta}.
      Γ~(t)=Q~(Λ~(t)σe2(t)+Inσβ2(t))1Q~Tβ^(t+1)=XTΓ~(t)(yZω(t))/σe2(t)\begin{equation}\begin{split} \tilde{\mathbf{\Gamma}}^{(t)} &= \tilde{\mathbf{Q}} \left(\frac{\tilde{\mathbf{\Lambda}}^{(t)}}{\sigma_e^{2(t)}} + \frac{\mathbf{I}_n}{\sigma_\beta^{2(t)}} \right)^{-1} \tilde{\mathbf{Q}}^T\\ \hat{\mathbf{\beta}}^{(t+1)} &= \mathbf{X}^T \tilde{\mathbf{\Gamma}}^{(t)}(\mathbf{y}-\mathbf{Z}\mathbf{\omega}^{(t)})/\sigma_e^{2(t)} \end{split}\end{equation}
    2. M-step: Estimate Θ^={ω^,σ^β2,σ^e2}\hat{\mathbf{\Theta}}=\{\hat{\mathbf{\omega}}, \hat{\sigma}_\beta^{2}, \hat{\sigma}_e^{2}\}.
      ω^(t+1)=(ZTZ)1ZT(yXβ(t+1))σ^β2(t+1)=1p(σβ2σe2σe2+σβ2λ~i+β(t+1)2+(pn)σ^β2(t))σ^e2(t+1)=1n(yZω^(t+1)2+i=1pλ~i(t)σe2(t)σβ2(t+1)σe2(t)+σβ2(t+1)λ~i(t)+Xβ(t+1)22(yZω^(t+1))TXβ(t+1))\begin{equation} \begin{split} \hat{\mathbf{\omega}}^{(t+1)} &= (\mathbf{Z}^T \mathbf{Z})^{-1} \mathbf{Z}^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}^{(t+1)})\\ \hat{\sigma}_\beta^{2(t+1)} &= \frac{1}{p} \left(\frac{\sigma_\beta^2 \sigma_e^2}{\sigma_e^2 + \sigma_\beta^2 \tilde{\lambda}_i} + \|\mathbf{\beta}^{(t+1)}\|^2+(p-n)\hat{\sigma}_\beta^{2(t)} \right)\\ \hat{\sigma}_e^{2(t+1)} &= \frac{1}{n}\left( \|\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}}^{(t+1)}\|^2 + \sum_{i=1}^p \frac{\tilde{\lambda}_i^{(t)} \sigma_e^{2(t)} \sigma_\beta^{2(t+1)}}{\sigma_e^{2(t)} + \sigma_\beta^{2(t+1)} \tilde{\lambda}_i^{(t)}}\right.\\ &\left.\quad+ \|\mathbf{X}\mathbf{\beta}^{(t+1)} \|^2 - 2(\mathbf{y}-\mathbf{Z}\hat{\mathbf{\omega}}^{(t+1)})^T \mathbf{X}\mathbf{\beta}^{(t+1)}\right) \end{split}\end{equation}
    3. Calculate (Θ(t+1))\ell(\mathbf{\Theta}^{(t+1)}):
      (Θ(t+1))=n2log(2π)12logi=1n(σβ2(t+1)λ~i(t)+σe2(t+1))12(yZω(t+1))TQ~(σβ2(t+1)Λ~(t)+σe2(t+1)In)1Q~T(yZω(t+1))\begin{equation} \begin{split} \ell(\mathbf{\Theta}^{(t+1)}) &= -\frac{n}{2} \log(2\pi)-\frac{1}{2} \log\sum_{i=1}^n \left(\sigma_\beta^{2(t+1)} \tilde{\lambda}_i^{(t)} + \sigma_e^{2(t+1)} \right) \\&\quad- \frac{1}{2} (\mathbf{y} - \mathbf{Z}\mathbf{\omega}^{(t+1)})^T \tilde{\mathbf{Q}} \left(\sigma_\beta^{2(t+1)} \tilde{\mathbf{\Lambda}}^{(t)} + \sigma_e^{2(t+1)} \mathbf{I}_n\right)^{-1} \tilde{\mathbf{Q}}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega}^{(t+1)}) \end{split} \end{equation}
    4. Check Δ=(Θ(t+1))(Θ(t))<ε|\Delta \ell|=|\ell(\mathbf{\Theta}^{(t+1)}) - \ell(\mathbf{\Theta}^{(t)})| < \varepsilon for convergence. If converged, stop. Otherwise, continue.
  3. Return results.

Codes and Results

TODO:

Codes

  1. Generate data.
  2. Data exploration.
  3. EM algorithm.

Results

The results below are obtained by running the EM algorithm on a generated dataset.
Result on generated dataset

The results in code are obtained by running the EM algorithm on a given dataset.

TODO:

Acceleration:

  1. Aitken acceleration
  2. numba

  1. Notes on the derivation of the posterior distribution of β\mathbf{\beta}: Let g(β)=12(yZωXβ)T(σe2In)1(yZωXβ)12βT(σβ2Ip)1βg(\mathbf{\beta})= -\frac{1}{2}(\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta})^T (\sigma_e^2\mathbf{I}_n)^{-1} (\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta}) -\frac{1}{2} \mathbf{\beta}^T (\sigma_\beta^2 \mathbf{I}_p)^{-1} \mathbf{\beta}. Then we have g(β)=XT(yZωXβ)/σe2β/σβ2\nabla g(\mathbf{\beta}) = \mathbf{X}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega} - \mathbf{X}\mathbf{\beta})/\sigma_e^2 - \mathbf{\beta}/\sigma_\beta^2. Let g(β)=0\nabla g(\mathbf{\beta}) = 0, we have β=ΓXT(yZω)/σe2\mathbf{\beta} = \mathbf{\Gamma} \mathbf{X}^T (\mathbf{y} - \mathbf{Z}\mathbf{\omega})/\sigma_e^2, where g(β)g(\mathbf{\beta}) is maximized. Thus, we find the μ\mathbf{\mu}. To obtain the variance of β\mathbf{\beta}, we need to calculate the Hessian matrix of g(β)g(\mathbf{\beta}) and evaluate it at β=μ\mathbf{\beta} = \mathbf{\mu}. The Hessian matrix is given by 2g(β)=XTX/σe2+Ip/σβ2\nabla^2 g(\mathbf{\beta}) = \mathbf{X}^T \mathbf{X}/\sigma_e^2 + \mathbf{I}_p/\sigma_\beta^2. Therefore, the variance of β\mathbf{\beta} is Γ=(XTXσe2+Ipσβ2)1\mathbf{\Gamma} = \left(\frac{\mathbf{X}^T \mathbf{X}}{\sigma_e^2} + \frac{\mathbf{I}_p}{\sigma_\beta^2} \right)^{-1}. ↩︎

  2. logΣ\log |\mathbf{\Sigma}|: Directly calculate the log determinant of Σ\mathbf{\Sigma} would cause numerical overflow:RuntimeWarning: overflow encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs) That's because "If an array has a very small or very large determinant, then a call to det may overflow or underflow". When the dimension of Σ\mathbf{\Sigma} is large, Σ|\mathbf{\Sigma}| would be extremely close to zero so that it may be considered as zero by numpy. Therefore, we use logΣ=i=1nlogλi\log |\mathbf{\Sigma}| = \sum_{i=1}^n \log \lambda_i to calculate the log determinant of Σ\mathbf{\Sigma}. ↩︎

  3. When calculating the eigenvalue decomposition, it's better to use 'numpy.linalg.eigh' instead of 'numpy.linalg.eig' since Γ~\tilde{\mathbf{\Gamma}} is a real symmetric matrix. The former is faster and more accurate than the latter. ↩︎