统计类工作technical interview 刷题系列 - Mixed model

modifiedname

注册一亩三分地论坛，查看更多干货！

您需要登录才可以下载或查看附件。没有帐号？注册账号

x

什么公司会有统计technical interview我就不再重复了，说了无数次了。
不同职位要求不同请自己查。

技术细节欢迎纠结指错，省得我面试的时候说错！！先感谢！！

先写下我记得的问题，随机补充答案
======================================================================================
why mixed model?

linear regression model 需要每个y given X 剩余error term是彼此独立的，如果这部分error不能保证独立，就需要额外去model

如果没有 missing data, observation is complete and balanced, 可以Multivariate ANOVA, (response is a matrix instead of vector) 但是现实中少用。。
另外当然 MANOVA only for categorical predictors.

missing completely at random可以用mixed model利用missing data剩余部分的一些信息，所以较优

missing value imputation 我只知道KNN，还能怎么做？

方程就不敲了。
-------------------------------
两大类的写法：
1. random coefficient model:　Random vs. Fixed effect, when to use which?
如果effect 是population 里面一个随便sample，你希望infer to general case, then random, 例子是品茶什么东西的味道如何，做实验的人是随机选取的，最终希望infer到一般人的口味，所以是random
如果总共也就那么几个选择，比如treatment effect 肯定是fixed

TBD
random intercept
random intercept, random slope model

2. covariance pattern model: 就是unstructured, compound symmetry, AR(1) 之类，model error term 矩阵的样子，要研究数据性质决定，fit的话看LR(?)
-------------------------------
Model fit , 点估计：get MLE for fixed effects, report var for random effect
predicted values are based on the above.

ML -> under estimate variance
REML -> unbiased variance , even tho coefficients dont change.

REML 跟ML 到底哪里不一样？（........TBD)

-------------------------------
get MLE: EM algorithm, Newton Ralphson, 大意是在一个点泰勒展开，E 估计一个点，然后M maximize logLik. 其实就是带入泰勒展开式求下一个点，直到收敛。Hessian初始值是二次导数的期望

-------------------------------
假设检验：linear contrast 好说

-------------------------------
model selection: LR, AIC, BIC to penalize additional parameters, +2q or +2q/log(n)的区别，一个提供最终model参数的upper bound, 1 get lower lower. final model can lie n between. 这些数字都越小越好。given blablabla... 当然不是nested model不要乱用LR

-------------------------------
diagnostics: normal plot and residuals
remedy?? TBD

relakuma

Gibbs sampler说来就话长了。简单来说Bayes的目标是为了从后验分布里面抽样，然后通过求后验分布的一些descriptive statistics来做Inference。但是很多情况下后验分布即使可以显式写出，但也没有办法直接sample。于是大家就只能用MCMC来抽样。而Gibbs sampler就是一种特殊的MCMC。对于一个full likelihood，虽然有时候joint sample很难，但是如果只看一个parameter同时condiiton on其他parameter当前的值而得到的conditional distribution往往是很简单的。举个例子

p(x, y) \propto e^{-yx^2/2-y}y^2
这个joint distribution看起来就很难直接sample，但是你可以写出两个conditional distribution,

x|y ~ N(0, 1/y)
y|x ~ Gamma(3, x^2/2+1)
而这两个分布都是很好sample的。所以Gibbs sampler的大致想法就是，我从给定的一组初值(x_0, y_0)出发，逐步迭代抽样: x_1 ~ p(x|y_0), y_1 ~ p(y|x_1), x_2 ~ p(x|y_1)...
最后你可以证明这个markov chain是能以概率1在有限步内converge到你的目标distribution的，即从某步开始 (x_n, y_n) ~ e^{-yx^2/2}y^2, 从而通过这种方法实现Joint sampling.

P.S: 其实这里的这个joint distribution先把x积分掉会很简单，所以只是拿来做一个例子而已。另外这里x的marginal应该是一个t distribution.

觉即不随

十分感谢 wwrechard 同学帮忙复习Bayesian！

About REML and ML:
The biggest difference between the two lies in generalized linear mixed model estimate.

ML: The estimate that maximizes the log-likelihood function. For example, to estimate the fixed and random effect in a linear mixed model, we pud down the log-likelihood function which involves the fixed effects and variance components, and then find the maximum.
MLE is asymptotically unbiased.  But in a mixed effect model, the estimate for variance components is biased.

REML: Targeting at getting an unbiased estimate for the variance components in the mixed model.  The algorithm would fit the fixed part of th e model first and then use the residuals to fit the log-likelihood on the variance components.

In R and SAS, we always have both options.  The default in SAS proc glmmix, the default is REML.

relakuma

举个例子，你在调查一个手术的执行结果。整个实验在五家不同的医院做（都有很多case），还有其他很多因素比如用了不同的药品。简单考虑一个logistic模型，y = g^{-1}(alpha[i] + X beta). alpha是医院的影响，其中下标i指代这个case在哪个医院做。Bayes hierarchical modeling就是在这5个alpha[i]上在放一组hyper-parameter, alpha_0, sigma_0来model 不同医院的相似性和相异性，即我们假设 alpha[i] ~ N(alpha_0, sigma_0^2)。所以整个model最后是这样的（包括prior）

y = g^{-1}(alpha[i] + X beta)
alpha[i] ~ N(alpha_0, sigma_0), i =1,2,3,4,5
alpha_0 ~ prior(alpha_0)
sigma_0 ~ prior(sigma_0)
beta ~ prior(beta)

然后用Gibbs sampling 来抽后验就可以了，一般hierarchical model的Gibbs sampler都比较好写出来。

modifiedname

是要做系列。

next topic is linear models,

争取说清楚what and why. How 反正面试时候也难以考察，并且假定从前做过一点，知道what and why 之后，search怎么做也还挺快的。

why am I doing this?　因为我相信试图给别人解释，是自己掌握内容最好的办法。
如果我有说的不清楚的地方，欢迎提问，我争取能说清楚。

relakuma

本帖最后由 wwrechard 于 2014-1-11 10:28 编辑

Missing value imputation不是一般用MICE(multiple imputation via chain equations)么？另外REML和ML的区别(random effect和fixed effect的区别)在Bayes里面超级直观，后者是一般的Model，前者是hierarchical model。另外我不理解你说的mixed model是什么，是mixture model吗？。如果是Mixture，据我的理解，用mixture model来model的好处是heavy tail. 另外Diagnostic里面还有很多可以做吧，除了qqplot和box-cox transform，还可以看看leverage，leave one (predictor) out plot，residuals vs fitted, 以及Cook's distance.

modifiedname

mixed != mixture...

relakuma

google了一下发现原来mixed model就是同时有fixed effect和random effect的model啊。。。囧。。。哎，反正在Bayes里，random effect这个东西非常trivial...

modifiedname

我的bayesian学的跟没学一样，求解释为什么random effect in bayesian

btw 1 楼说EM是未必的--只有复杂到无法直接解才需要EM

relakuma

另外用Bayes来做这种hierachical modeling的好处在于有JAGS这种软件的存在。用JAGS的话你都不需要自己求Posterior，只要你能够把model像我上一楼那样从上往下写出来，他就能自己算posterior (用slice sampling)。因为JAGS本身是用C++写的，而且在R里面有直接对应的package，实际应用也还是挺方便的。

modifiedname

感谢~~~~~~继续求解释，gibbs sampler工作原理，适用情况

[统计生统] 统计类工作technical interview 刷题系列 - Mixed model

注册一亩三分地论坛，查看更多干货！

评分

相关帖子

浏览过的版块