Why Sailfish gives unexpected performance in the strand specific library

75 views
Skip to first unread message

Azim

unread,
Dec 4, 2014, 6:04:35 AM12/4/14
to sailfis...@googlegroups.com
Hi,

I'm using Sailfish for the quantification of the reads. our libraries are paired end and strand specific by design. the R1 is the sense reads and the R2 is anti-sens reads. the read length are around 100 bps.

when I try to use Sailfish for quantification by using the R1 as sense and R2 as anti-sense , it result in very poor mapping ratio (0.004), but when I use R1 as anti-sense and R2 as sense then mapping ratio increase (0.34). but we know that by design of our library protocol, R1 is sense and R2 is antisense.

moreover,  according to our prior knowledge we expect that some specific genes should highly expressed in specific conditions, but when I checked the results of quantification (R1 as sense and R2 as antisense, also R1 as anti-sense and R2 as sense), it was not the case.

Here is the command I use for specification of strands information to Sailfish :


R1 as sense and R2 as anti-sense :

sailfish quant  -i indexfile   -l "T=PE:O=><:S=SA" -1 <(gunzip -c Merged_sample_A1_R1.fastq.gz)  -2 <(gunzip -c Merged_sample_A1_R2.fastq.gz) -o Merged_sample_A1


R1 anti-sense and R2 as sense  :

sailfish quant  -i indexfile   -l "T=PE:O=><:S=AS" -1 <(gunzip -c Merged_sample_A1_R1.fastq.gz)  -2 <(gunzip -c Merged_sample_A1_R2.fastq.gz) -o Merged_sample_A1_a
 

I was wondering why we are getting weird results. what can be wrong ?

BTW,  can  it because of the case that genes are more similar to each other in that specific organism ?

since the reads length are around 100 bps, should I use higher value of k-meres size instead of k=20 ?

Rob

unread,
Dec 4, 2014, 12:16:25 PM12/4/14
to sailfis...@googlegroups.com
Hi Azim,

On Thursday, December 4, 2014 6:04:35 AM UTC-5, Azim wrote:
Hi,

I'm using Sailfish for the quantification of the reads. our libraries are paired end and strand specific by design. the R1 is the sense reads and the R2 is anti-sens reads. the read length are around 100 bps.

when I try to use Sailfish for quantification by using the R1 as sense and R2 as anti-sense , it result in very poor mapping ratio (0.004), but when I use R1 as anti-sense and R2 as sense then mapping ratio increase (0.34). but we know that by design of our library protocol, R1 is sense and R2 is antisense.

What is the mapping ratio if you don't enforce the directionality constraints?  One thought I have is that there may be an "inversion" in the interpretation here.  If R1 is generated from the sense strand, then that implies that it will map to the antisense strand and vice-versa.  In sailfish (and salmon, see below) the directionality that is specified actually represents the strand to which you expect the k-mers to map.  This is the opposite of the strand from which they were generated.  The distinction is a bit confusing, so I'll update the documentation accordingly.
 

moreover,  according to our prior knowledge we expect that some specific genes should highly expressed in specific conditions, but when I checked the results of quantification (R1 as sense and R2 as antisense, also R1 as anti-sense and R2 as sense), it was not the case.

This is a difficult question to answer.  Specifically, this is the case because all of the expression estimates are relative abundances.  Thus, it's possible for a gene to be "highly" expressed, but still have a low relative abundance (if, e.g. many other genes are also highly expressed).  
 

Here is the command I use for specification of strands information to Sailfish :


R1 as sense and R2 as anti-sense :

sailfish quant  -i indexfile   -l "T=PE:O=><:S=SA" -1 <(gunzip -c Merged_sample_A1_R1.fastq.gz)  -2 <(gunzip -c Merged_sample_A1_R2.fastq.gz) -o Merged_sample_A1


R1 anti-sense and R2 as sense  :

sailfish quant  -i indexfile   -l "T=PE:O=><:S=AS" -1 <(gunzip -c Merged_sample_A1_R1.fastq.gz)  -2 <(gunzip -c Merged_sample_A1_R2.fastq.gz) -o Merged_sample_A1_a
 

I was wondering why we are getting weird results. what can be wrong ?

BTW,  can  it because of the case that genes are more similar to each other in that specific organism ?

since the reads length are around 100 bps, should I use higher value of k-meres size instead of k=20 ?
 
To answer your above questions directly, if you're in an organism with a large number of very sequence-similar genes (as opposed to isoforms of the same gene), 
this can complicate the inference task.  However, that is true of any model.  With sailfish, I would expect that increasing the k-mer size could help 
(they reported this, e.g. in the RNA-Skim paper). But this may not be ideal in this scenario since the mapping rate is already fairly low.

Given that you have a 100bp, paired-end strand-aware library, I strongly suggest that you give Salmon a try.  It retains the speed benefits of Sailfish,
but has the potential of being significantly more accurate --- specifically with longer and/or paired-end reads.  We've done quite a bit of testing on the software 
so far, and already have a number of people using it "in the wild" reporting promising results.  Certainly, it would be worth it to see if the results provided 
by Salmon agree more strongly with your a priori expectation about which genes are highly expressed in different conditions.  If you still see that the 
genes appear to not be highly expressed, then this provides evidence that it may be due to a large number of other, relatively highly expressed genes.
However, if the estimates do agree better with your expectations, this is evidence that the improved model of salmon may provide a significant benefit 
in your case.  Please, keep us posted!

Best,
Rob

Azim

unread,
Dec 4, 2014, 12:48:14 PM12/4/14
to sailfis...@googlegroups.com
Thanks Rob,

what do you mean by directionality constraint? if you mean without specifying the strandness information ( -l "T=PE:O=><:S=U"), the mapping ratio is almost the same (0.341232). if you mean the without orientation constraint (-l "T=PE:S=U", -l "T=PE:S=AS")gives almost similar mapping ratio (0.34xxx)  , but -l "T=PE:S=SA"  gives very poor result(~ 0.004).
Reply all
Reply to author
Forward
0 new messages