Linear model and cell perturbation probability

Iwo

unread,

Mar 10, 2017, 4:42:25 AM3/10/17

to Perturb-seq

Hi,

Thanks for this excellent work and making the data available! I have some questions about maths and computations.

1. Do your linear models include an intercept? Is it set so that intercept equals to the mean of expression of unperturbed cells (cell withouth sgRNAs) or cells with a control guide?

2. I am also really curious in how you determine the probability that a cell has been perturbed. Would you mind explaining this process? If I understood correctly the section in the supplement, the final equation gives the probability that a cell j is perturbed (P(Xj = 1), but could you explain the meaning of the variables? I presume that the Y indicates observed expression level for gene i (and Y-hat the predicted value), Beta is the respective coefficient vector for the gene (column from the coefficient matrix for the main linear model), X0 is the row from the design matrix with 1 replaced by 0 for the corresponding sgRNA tested, is that correct? What about the sigma, could you explain which standard deviation it corresponds to? If you have some example code it would be hugely helpful.

Many thanks in advance,

Iwo

Atray Dixit

unread,

Mar 10, 2017, 11:01:08 AM3/10/17

to Perturb-seq

Hi Iwo,

1. Yes they do include an intercept. We have tried both setting the intercept to be equal to the mean expression of all cells (perturbed and unperturbed) as well as referencing it to just the control guide cells. Since we only had one nontargeting guide for the BMDC screen we set the intercept to the mean across all cells. In the pooled cloning figure (FigS5), we had several controls and set the intercept to those.

2. You are correct on all points re: variables. Our standard deviation was set to the standard deviation of the gene across all cells. You should be able to find code here (https://github.com/asncd/MIMOSCA) particularly in the run_model_bycol and bayes_cov_col functions in mimosca.py and this ipython notebook. The assumption of Gaussianity and constant standard deviation between perturbed and unperturbed population are not strictly true, but we found worked in practice especially when there is a strong signal across many genes. The residuals become closer to normally distributed as relevant covariates are added (ex: cell state covariates can explain bimodal expression patterns).

Good luck,

Atray

iwo.ku...@gmail.com

unread,

Mar 14, 2017, 11:12:45 AM3/14/17

to Perturb-seq

Thanks a lot for the reply and providing the example, that's hugely helpful!

Best,
Iwo

Atray Dixit

unread,

Mar 14, 2017, 7:40:10 PM3/14/17

to Perturb-seq

No problem!

Reply all

Reply to author

Forward