Pseudo-Pooled Replicates

145 views
Skip to first unread message

Sammy Klasfeld

unread,
Oct 1, 2017, 11:58:26 PM10/1/17
to idr-discuss
In the google code for the IDR pipeline it says:

nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / 2 )) # half that number
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into two parts

What I believe it should say is:

nreps = 3 # Number of treatment replicates 
nlines=$( zcat ${fileName} | wc -l ) # Number of reads in the tagAlign file
nlines=$(( (nlines + 1) / ${nreps} )) # that number divided by $nreps
zcat "${fileName}" | shuf | split -d -l ${nlines} - "${outputDir}/${outputStub}" # This will shuffle the lines in the file and split it into $nreps parts. Two of them will be your pooled pseudo replicates.

By splitting the pool into two rather than the number of replicates (in my case nreps=3) then you are comparing true replicates of size X to pooled pseudo replicates of size 1.5X (or greater if nreps>3). Therefore, you are instead asking are the two replicates more than twice as different from each other, than when we sample the entire diversity randomly and also add half more reads. If you split by 3 instead and only compare the first two pseudo-pooled replicates then you are instead asking a more logical question which is: are the two replicates more than twice as different from each other, than when we sample the entire diversity randomly? When you only have two replicates, it makes sense to split the pool by 2, but if you have more than 2 then splitting it by 2 no longer answers the original question. 

If you worry that the 2/3 pooled pseudoreplicates does not contain all the variance of the full 3 replicates then you can bootstrap for 2/3 to get a mean and variance.

Could anyone correct me if I am wrong? If not, feel free to send me vaidation.

Anshul Kundaje

unread,
Oct 2, 2017, 5:30:47 PM10/2/17
to idr-d...@googlegroups.com
Your suggestion is certainly valid.

However, we simply want to check what is the best possible signal we can retrieve from the data. So obtaining 2 pooledpseudoreplicates from N true replicates definitely gives an advantage to the pooledpseudoreps. But that is what the current design is looking for. If you did want to do an exact head-to-head matched for depth, what you suggest would be optimal. 

We generally only deal with at most 2 reps and the pipeline is stable and deployed as part of ENCODE. So we wont be pushing a modification to the procedure. But feel free to use whichever option suits your needs.

-Anshul.

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sammy Klasfeld

unread,
Oct 3, 2017, 12:34:14 AM10/3/17
to idr-discuss
Thank you for your response. However, I am still confused. I thought that the purpose of the pooled pseudo replicates is to check the number of reproducible peaks between them in comparison to the number of reproducible peaks between true replicates. I understand you will get more peaks if you split the pooled reads by 2 rather than N because you are artificially creating more depth and therefore will get more "reproducible replicates" between the pseudo-pooled peaks. However, you wouldn't be comparing comparable datasets since the pseudo-pooled comparison will by definition have extra signal.  What do you mean by "best possible signal"? Do you just mean more reproducible peaks in the pooled pseudo replicate comparison?
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.

Anshul Kundaje

unread,
Oct 3, 2017, 12:48:34 AM10/3/17
to idr-d...@googlegroups.com
That isn't the only purpose. 

The pooled pseudoreps provide us the highest possible reproducible sensitivity for peak detection based on sampling noise alone. The true replicates provide conservative calls based on biological reproducibility. In many applications we want the highest reproducible sensitivity we can get out of all the replicates. It helps provide an optimistic upper bound on how much one would gain if the same number of total reads were used on two replicates that were sequenced deeper and more equally.

Like I said, your strategy is much better suited if you wanted to get head to head comparisons of pooled pseudoreps vs. true reps.

In most cases where we have encountered more than 2 reps, the > 2 reps are more than often poorly sequenced. So we intentionally want to boost detection power.

Anshul

To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss+unsubscribe@googlegroups.com.

Anshul Kundaje

unread,
Oct 3, 2017, 12:50:11 AM10/3/17
to idr-d...@googlegroups.com
Also our most common use case is 2 reps.

Anshul.
Reply all
Reply to author
Forward
0 new messages