The OLM (Ordinary Linear Model) assumes that the observations are independent and identically distributed (i.i.d.). It can be written as y=Xγ+e,
where y∈Rn is the vector of response variable, X∈Rn×p is the matrix of p independent variables, γ∈Rp is the vector of coefficients, and e∼N(0,σe2In) is the independent noise term. To simplify the notation, we assume that X and y are centered, i.e. the mean of y is 0, and the mean of X is 0 as well. Thus, we do not need to estimate the intercept term.
In the OLM, if we want to obtain the coefficients γ, we can use the least square method to minimize the loss function. However, when n<p, the matrix X is not full rank, and the least square method could lead to overfitting. Because we have p+1 (including σe2) parameters to estimate, but only n observations.
Let's assume the design matrix X can be decomposed into two parts, i.e., X=[X1,X2], where X1∈Rn×p1,p1<n≪p and X2∈Rn×p2. And, γ can be break down into a fixed effect part ω∈Rp1 and a random effect part β∈Rp2 the distribution of which is N(0,σβ2Ip2), i.e., γT=[ωT,βT]. Then, we have y=X1ω+X2β+e
Thus, we obtain the LMM (Linear Mixed-Effect Model). In this case, the number of parameters to estimate is p1+1+1 (including σβ2 and σe2), which is much smaller than p+1 in the OLM.
Model Description
LMM is a statistical model that accounts for both fixed effects and random effects in a linear regression model. It is used for modeling data where observations are not independent or identically distributed.
Consider a dataset {y,X,Z} with n samples, where y∈Rn is the vector of response variable, X∈Rn×p is the matrix of p independent variables, and Z∈Rn×c is another matrix of c variables. The linear mixed model builds upon a linear relationship from y to X and Z by y=fixedZω+randomXβ+errore,
where ω∈Rc is the vector of fixed effects, β∈Rp is the vector of random effects with β∼N(0,σβ2Ip), and e∼N(0,σe2In) is the independent noise term.
The LMM can be solved by various methods, such as the restricted maximum likelihood (REML) and the maximum likelihood (ML). The REML is a method of estimation that does not base estimates on a maximum likelihood fit of all the information, but instead uses a likelihood function derived from a transformed set of data, so that nuisance parameters have no effect. The ML is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. The ML is a special case of the maximum a posteriori estimation (MAP) that assumes that the prior over the parameters is uniform or non-informative.
Formal Description of Bayesian Inference and Variational Inference
The following contents are basically from Wikipedia-BI and Wikipedia-VBM. The purpose of this section is to provide a formal description of Bayesian inference and variational inference, which will be used in the following sections.
Definitions
x: a data point in general.
X: sample, a set of n observed data points, i.e., X={x1,x2,⋯,xn}.
x~: a new data point to be predicted.
θ: the parameters of the data point's distribution, i.e., x∼p(x∣θ).
α: the hyperparameters of the distribution of θ, i.e., θ∼p(θ∣α).
Z: the latent variables of the data point's distribution, i.e., x∼p(x∣Z).
q(Z): variational distribution, a distribution over the latent variables Z, i.e., P(Z∣X)≈q(Z).
Bayesian Inference
Prior: p(θ∣α), the distribution of the parameters θ before observing the data X.
Sampling Distribution and Likelihood: p(X∣θ)=L(θ∣X), the distribution of the data X given the parameters θ, or the likelihood of the parameters θ given the data X.
Marginal Likelihood (Evidence): p(X∣α), the distribution of the observed data X marginalized over the prior distribution of the parameters θ, i.e., p(X∣α)=∫p(X∣θ)p(θ∣α)dθ.
Posterior: p(θ∣X,α), the distribution of the parameters θ after observing the data X. It is proportional to the product of the likelihood and the prior, i.e., p(θ∣X,α)∝p(X∣θ)p(θ∣α).
Posterior Predictive Distribution: p(x~∣X,α)=∫p(x~∣θ)p(θ∣X,α)dθ, where p(x~∣θ) is the distribution of the new data point x~ given the parameters θ, and p(θ∣X,α) is the posterior distribution of the parameters θ given the observed data X.
Prior Predictive Distribution: p(x~∣α)=∫p(x~∣θ)p(θ∣α)dθ, where p(x~∣θ) is the distribution of the new data point x~ given the parameters θ, and p(θ∣α) is the prior distribution of the parameters θ.
Variational Inference
KL Divergence: KL(q∣∣p)=∫q(Z)logp(Z∣X,θ)q(Z)dZ, where q(Z) and p(Z∣X,θ) are two distributions over the latent variables Z.
Q Function: Q(θold,θ)=Ep(Z∣X,θold)[logp(X,Z∣θ)]+const, where p(X,Z∣θ) is the joint distribution of the observed data X and the latent variables Z which is complete data set.
Evidence Lower Bound (ELBO): L(q,θ)=ELBO(q,θ)=∫q(Z)logq(Z)p(X,Z∣θ)dZ=∫q(Z)logp(X∣Z,θ)dZ−KL(q(Z)∣∣p(Z∣X,θ)), where p(X,Z)=p(X∣Z)p(Z) is the joint distribution of the observed data X and the latent variables Z, and p(Z) is the prior distribution of the latent variables Z.
Mean Field Approximation: q(Z)=∏i=1nqi(Zi), where Zi is the i-th latent variable.
Coordinate Ascent Variational Inference: qi∗(Zi)=argmaxqi(Zi)ELBO(q), where q−i(Z−i)=∏j=iqj(Zj) is fixed.
Expectation Maximization with Mean Field Approximation: qi∗(Zi)=argmaxqi(Zi)Eq−i(Z−i)[ELBO(q)], where q−i(Z−i)=∏j=iqj(Zj) is updated by q−i∗(Z−i)=argmaxq−i(Z−i)Eqi(Zi)[ELBO(q)]. In practice, we usually work in the log space, i.e., logqi∗(Zi∣X)=Eq−i(Z−i)[logp(X,Z)]+const.