RE: Enquiry about SCALE

Jiang, Yuchao

unread,

Apr 28, 2017, 4:53:31 PM4/28/17

to Roy Moh Lik Ang, SCALE_s...@googlegroups.com, min...@mail.med.upenn.edu, Zhang, Nancy R

Hi Roy,

Thanks for your interest in SCALE! Yes, SCALE can be applied to dataset where only a subsection of the cells have spike-ins. As a matter of fact, the real datasets that we have in our paper (human fibroblast + mouse blastocyst) have 12 and 30 cells with spike-ins. We estimate the technical noise associated parameters assuming they are shared across cells that are sequenced and processed from the same batch.

You can just feed in the normalization step with the cells that have spike-ins. With this said, I’m not sure what leads to 60% of the cells with **detected** spike-in reads. It seems to me this can result from poor library prep (e.g., balancing the concentrations of spike-ins and endogenous RNAs) and if this is the case, I would recommend a stringent QC on the cells, not just the ones without spike-ins but the entire cohort.

Hope that this helps and feel free to let us know if you have further questions!

Yuchao

From: Roy Moh Lik Ang [mailto:ra...@stanford.edu]
Sent: Friday, April 28, 2017 4:20 PM
To: Jiang, Yuchao <yuc...@wharton.upenn.edu>
Subject: Enquiry about SCALE

Dear Yuchao,

I am Roy, a second-year genetics PhD student at Stanford University. I recently came across your paper published in Genome Biology, titled “SCALE: modeling allele-specific gene expression by single-cell RNA sequencing”. I am interested in applying your package to study transcriptional bursting characteristics in another scRNA-seq dataset with allele-specific counts, but what I have found is that although spike-ins were added to every cell, only ~60% of the sequenced cells have detected the spike-in molecules. I was wondering if SCALE can still be applied to the entire scRNA-seq data set, or do cells without spike-ins reads detected have to be filtered out first before analysis? Is there some way to estimate technical noise for all single cells using only spike-ins from a subset of them?

Your help is greatly appreciated. Congratulations on your publication!

Regards,

Roy

Jiang, Yuchao

unread,

Apr 28, 2017, 5:05:36 PM4/28/17

to Roy Moh Lik Ang, SCALE_s...@googlegroups.com

Just to clarify -- for our case, only 12 and 30 cells have added spike-ins and we use them all. The other cells don’t have spike-ins added.

Yuchao

Jiang, Yuchao

unread,

May 3, 2017, 11:44:05 AM5/3/17

to Roy Moh Lik Ang, SCALE_s...@googlegroups.com

Hi Roy,

No problem! Happy to help. As we’ve mentioned in our paper, to correctly calculate the cell size factor isn’t trivial. It can be inferred from the GAPDH expression levels (with adjustment of library size factor which is just due to sequencing depth rather than cell size difference, see first equation under Methods section in our paper) or the ratio of endogenous reads over the total number of spike-in reads. To make the measurements consistent, you should NOT combine these two together.

Note that we find it hard to correctly measure the cell size factor based on the sequencing reads alone. We are hoping that there could be measurements available from the single cell isolation machine (e.g., Fluidigm C1) soon. I recommend first estimating the bursting kinetics with all cell size factors set to 1. After this, you can try again using the estimated cell size factors. The correlations between the bursting kinetics of the two alleles can serve as a good sanity check (on the genome wide scale, they should be correlated with a fairly decent correlation coefficient).

Hope that this answers your question and feel free to let us know if you have any additional ones.

Cheers,

Yuchao

On May 2, 2017, at 8:21 PM, Roy Moh Lik Ang <ra...@stanford.edu> wrote:

Hi Yuchao,

Sorry to trouble you again. I have been looking through the SCALE vignette (thank you for making it so straightforward and easy to read/use!), and I am trying to figure out how you went about doing the cell size input. In the paper, you recommended using either the expression of GAPDH, or the ratio of endogenous reads over the total number of spike-in reads when spike-ins are available.

I have a couple questions:

Since not all your cells have spike-ins, I am assuming you used the ratio for those with spike-ins, and for those without spike-ins, you took the total counts of GAPDH for each cell from the Deng et al. (2014) total counts data? Is it okay to use two different methods to estimate cell size for the same cell population?
Did you have to do any normalization for sequencing depth/library size before keying in the cell size input, specifically referring to using GAPDH expression as the cell size estimator?

I do apologize if these questions have already been addressed previously. Your help is much appreciated!

Regards,

Roy

From: Roy Moh Lik Ang

Sent: Friday, April 28, 2017 2:14 PM
To: Jiang, Yuchao
Subject: RE: Enquiry about SCALE

Hi Yuchao,

Ah excellent. Thank you for addressing my query!

Have a good weekend,

Roy

From: Jiang, Yuchao
Sent: Friday, April 28, 2017 1:53 PM
To: Roy Moh Lik Ang
Cc: SCALE_s...@googlegroups.com; min...@mail.med.upenn.edu; Zhang, Nancy R
Subject: RE: Enquiry about SCALE

Jiang, Yuchao

unread,

May 10, 2017, 6:14:40 PM5/10/17

to Roy Moh Lik Ang, SCALE_s...@googlegroups.com, Zhang, Nancy R, Mingyao Li

Hi Roy,

I will put a more detailed explanation on our GitHub page later but now let me try to explain this to you here. The reason that you see a “poor" fitting of kappa and tau, as compared to the ones I had in our Supplementary Figure S5, is due to the low sequencing depth as well as the low amplification efficiency for your dataset. These are reflected by the alpha and beta terms. As you can see from the attached pdf (compared to Supplementary Figure S5 in our SCALE paper), the alpha is smaller (note that it is on the log scale) and the beta is more deviated from 1. This results in the alpha * (Y) ^ beta being much smaller than Y and even significantly smaller than 1 for the lowly spiked-in molecules or lowly expressed endogenous genes. In this case, when you perform the sequencing, no matter whether the transcript is dropped out (captured by kappa and tau) or not, it will almost always result in a zero from the Poisson sampling. As you can see, with the estimated kappa = 29.4 and tau = 8.54, the rate pi = expit(kappa + tau * log(Y)) is almost always 1 meaning that there is no drop out but rather the zeros are from Poisson sampling with a small lambda = alpha*(Y)^beta. However, this doesn’t mean there is no dropout — it simply means that the sequencing depth followed by the amplification is so low that we cannot estimate dropout using spike-ins (aka, kappa and tau are not identifiable).

I’ve carried out further downstream analysis (estimating kinetic parameters) with a version that doesn’t allow zeros due to Poisson sampling. The fitter curve will be the black one shown in the attached plot, with a kappa and tau in reasonable range. The estimated kinetic parameters don’t differ between kappa = 29, tau = 8.5 versus kappa = -3.8 and tau = 0.983. The code that I am using is attached. Note that I adopt some QC on cells and genes/spike-ins, which I think you also did in later attempts.

I hope this answers your question and reassures your good correlations of burst frequency and burst size. I will reply to your question on coordinated bursting in the other email and post on our GitHub page also.

Cheers,

Yuchao

On May 10, 2017, at 4:36 PM, Roy Moh Lik Ang <ra...@stanford.edu> wrote:

Hi Yuchao,

Thanks for pointing that out. Somehow I have not been establishing subsets of my spike-in data correctly. I have corrected my code. What I notice is that taking different subsets of the spike-in data to use as input for tech_bias still causes poor fitting of the kappa and tau curve.

Keep me posted on further developments of this issue. :)

Roy

From: Jiang, Yuchao
Sent: Wednesday, May 10, 2017 7:25 AM
To: Roy Moh Lik Ang

Subject: Re: Estimation of abkt terms
Hi Roy,

Sorry for my delayed reply. I’m starting to look into this issue. I know what is going on with the kappa and tau term and will reply to you after I make sure everything runs proper now. For your error in the email below, looking from the error message, it seems that the column names of
fib.spikein_input.subset
cannot be found in the column names of
fib.alleleA
These two needs to be matching with the former being a subset (since we don’t require all the cells to have spike-ins).

Let me know if this solves this error.

Thanks,

Yuchao

On May 8, 2017, at 3:02 PM, Roy Moh Lik Ang <ra...@stanford.edu> wrote:

Hi Yuchao,

I have been throwing subsets of my spike-in data into the tech_bias function to see if somehow maybe a subset of the cells with spike-ins are bad/outliers. Essentially, I selected for cells with at least a certain number of total spike-in reads. However, I keep running into this error:

> fib.spikein_input.subset = fib.spikein_input[, (colSums(fib.spikein_input[, 3:length(fib.spikein_input[1,])]) > 70000)]

> dim(fib.spikein_input.subset)

[1] 72 46

> abkt.new = tech_bias(spikein_input = fib.spikein_input.subset, alleleA = fib.alleleA,

+                  alleleB = fib.alleleB, pdf = TRUE)

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :

  NA/NaN/Inf in 'y'

Here’s another example:

> fib.spikein_input.subset = fib.spikein_input[, (colSums(fib.spikein_input[, 3:length(fib.spikein_input[1,])]) < 100000)]

> dim(fib.spikein_input.subset)

[1] 72 75

> abkt.new = tech_bias(spikein_input = fib.spikein_input.subset, alleleA = fib.alleleA,

+                  alleleB = fib.alleleB, pdf = TRUE)

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :

  NA/NaN/Inf in 'y'

Some other times, it says ‘x’ instead of ‘y’ as the matrix with NA/NaN/Inf:

> fib.spikein_input.subset = fib.spikein_input[, (colSums(fib.spikein_input[, 3:length(fib.spikein_input[1,])]) < 60000)]

> dim(fib.spikein_input.subset)

[1] 72 43

> abkt.new = tech_bias(spikein_input = fib.spikein_input.subset, alleleA = fib.alleleA,

+                  alleleB = fib.alleleB, pdf = TRUE)

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :

  NA/NaN/Inf in 'x'

I can’t say I am too familiar with R, but I took a look at the source code and was wondering if this could be the cause of the problem?

Y = Y[Q != 0]

Q = Q[Q != 0]

lmfit = lm(log(Q) ~ log(Y))

Thanks for the help!

Roy

From: Roy Moh Lik Ang
Sent: Friday, May 5, 2017 6:53 PM
To: Jiang, Yuchao

Subject: RE: Estimation of abkt terms

Hi Yuchao,

Yeah I looked into the distribution of burst frequencies and sizes but nothing out of the ordinary stands out. I would love to know more about what you find.

I have uploaded the rar file on Google Drive. You may access it here.

Thank you for looking into this!!

Roy

From: Jiang, Yuchao
Sent: Friday, May 5, 2017 6:39 PM
To: Roy Moh Lik Ang

Subject: Re: Estimation of abkt terms
Hi Roy,

Hmmm I looked at the supplement of the paper and saw 2x125bp. However it is for the human T-cell scRNAseq. The read lengths for mouse fibroblast are indeed ~43 on average like you said. I’m not sure what is going on but looking at the correlation plot on this dataset that you sent (which is based on alpha, beta, kappa, and tau), the results make a lot of sense and actually look very good. I will look into this.

Can you send the rar file via Dropbox or Google Drive? Our email server blocked it. My account for both is yj...@cornell.edu .

Thanks very much!

Yuchao
On May 5, 2017, at 9:14 PM, Roy Moh Lik Ang <ra...@stanford.edu> wrote:
WARNING: This e-mail has been altered by MIMEDefang.  Following this
paragraph are indications of the actual changes made.  For more
information about your site's MIMEDefang policy, contact
Wharton Email Services <postm...@wharton.upenn.edu>.  For more information about MIMEDefang, see:

            http://www.roaringpenguin.com/mimedefang/enduser.php3

An attachment named allele_counts.rar was removed from this document as it
constituted a security hazard.  If you require this document, please contact
the sender and arrange an alternate means of receiving it.

alleleA.rda

alleleB.rda

roy.R

spikein_input_roy.txt

tech_bias_no_poisson.R

tech_bias.R

technical_bias_abkt.pdf

Roy Moh Lik Ang

unread,

May 10, 2017, 6:52:11 PM5/10/17

to Jiang, Yuchao, SCALE_s...@googlegroups.com, Zhang, Nancy R, Mingyao Li

Hi Yuchao,

Thank you for that very elaborate explanation. In that case, do you recommend I use the kappa and tau terms obtained with no Poisson sampling (i.e. kappa=29, tau=8.5) or the other set of values for downstream analysis of the kinetic parameters, or would it make little difference to the outcome?

Also, thank you for sharing your scripts with me. I will see if I can reproduce the graph you provided.

Jiang, Yuchao

unread,

May 10, 2017, 6:56:12 PM5/10/17

to Roy Moh Lik Ang, SCALE_s...@googlegroups.com, Zhang, Nancy R, Mingyao Li

Hi Roy,

You are welcome. Happy to help and glad that we solve this issue! I would recommend using kappa = 29 and tau = 8.5, which is the output by SCALE by default. But they make very little difference — I verified this and the code is in my previous email.

Yuchao

Roy Moh Lik Ang

unread,

May 10, 2017, 6:58:51 PM5/10/17

to Jiang, Yuchao, SCALE_s...@googlegroups.com, Zhang, Nancy R, Mingyao Li

Hi Yuchao,

Understood. Thank you so much for your guidance on this matter! I am glad this matter has been resolved :)

Reply all

Reply to author

Forward