TPM/FPKM/ReadCount falls off a cliff

103 views
Skip to first unread message

Matthew Hockin

unread,
May 22, 2015, 1:18:28 PM5/22/15
to sailfis...@googlegroups.com
I noticed something I did not expect in the output of a Salmon run using 

 nohup salmon quant -i Salmon_index/GRCm38_cDNAall_SALMON_index -l U -p 110 -r 11145X1_141017_D00294_0137_AC5N9DANXX_3.fastq -o X1Salmon_GRCm38v3cDNAall &


What I observe, which is very different than what I see in the same data set using bowtie/cufflinks is the following, if you look through a sorted list (TPM) I find that the TPM values appear to fall on a "continuous" scale right up until they hit ~10 and then abruptly fall to zero.  This is I think unexpected, and I am worried I did something wrong in building an index, because its almost like there is no sequence representation for the genes falling below the TPM cliff in the index, and thus Salmon could not place reads there- but this cannot be the case because the index was built off of a Ensemble mouse cDNA all fasta file- and clearly that has sequence for every named transcript- Is this actually somehow expected behavior or do I have some detective work to do here? 

Here is the informative region of the output.

ENSMUST00000131444      686     10.0052 5.66702 839.658

ENSMUST00000101367      1519    10.0034 5.66601 1858.91

ENSMUST00000034547      3405    10.0028 5.66567 4166.7

ENSMUST00000105348      1222    10.0018 5.66514 1495.22

ENSMUST00000029909      2651    10.0018 5.66513 3243.72

ENSMUST00000185852      3946    10.0003 5.66428 4827.53

ENSMUST00000028351      1666    10      5.66412 2038.13

ENSMUST00000153446      2778    9.99936 5.66374 3398.28

ENSMUST00000149564      5763    9.99921 5.66366 7049.67

______and what happens here???????_____________________

ENSMUST00000165910      548     9.99878e-28     5.66341e-28     6.70319e-26

ENSMUST00000187005      1359    9.99847 5.66324 1662.29

ENSMUST00000169741      812     9.99812e-22     5.66303e-22     9.93181e-20

ENSMUST00000195782      665     9.99788e-64     5.6629e-64      8.13362e-62

ENSMUST00000189076      665     9.99788e-64     5.6629e-64      8.13362e-62

ENSMUST00000123141      1358    9.99787 5.66289 1660.97

ENSMUST00000022599      4582    9.99672 5.66224 5603.6

ENSMUST00000000939      4388    9.99541 5.6615  5365.64

ENSMUST00000155730      396     9.9953e-80      5.66144e-80     4.84223e-78

ENSMUST00000190964      1131    9.9929e-91      5.66008e-91     1.38264e-88


Rob

unread,
May 22, 2015, 1:26:30 PM5/22/15
to sailfis...@googlegroups.com, mho...@gmail.com
Hi Matthew,

  Can I ask what command you used to sort by TPM?  It looks to me like the values are being sorted by their string representation rather than their numerical values.
Since small numbers in floating point notation look like [0-9].XXXe-YYY, this output is exactly what you'd expect to get if you sort by the string rep.  If you're using the 
standard unix "sort" command, you can pass the option "-g" to have a field interpreted as a floating point value.  This should give you the output in numerically sorted order.
Please let me know if this was the issue, since a discrete "jump" in TPM/FPKM etc. values like that is not expected and shouldn't be occurring.

Best,
Rob

P.S.  I see you're running with -p 110 --- that's amazing :).  Out of curiosity, how quickly does salmon run with this option.

Matthew Hockin

unread,
May 22, 2015, 1:54:57 PM5/22/15
to sailfis...@googlegroups.com, mho...@gmail.com
DOH!

My bad- 
cat quant.sf | sort -nrk3 > sorted_list

was the command-

Indeed- 
cat quant.sf | sort -grk3 > sorted_list produces a list that looks far better.... 

I didn't time it, and I have played with -p 60 -p120 and watching the process in htop, in either configuration the processes appear to be running each core at about 20-30% duty with the rest of its capacity sitting in what I think is an iowait status- so I am not very sure that -p 120 vs -p60 makes much difference, but in terms of time- I was mapping a stranded library (Note I copied the wrong command to my first post, but it was immaterial (only library type was wrong) to this finding so I didn't bother to post back to correct) where the fastq was a full lane from an illumina HiSeq run and 36 GB in size (~220 Million reads) and its probably done mapping in about 5 minutes or so... very nice.

I usually run Salmon with as many processors as I have at hand, we have a pretty fast disc array and I just let the server sort it out- 

THANKS for catching that stupid mistake... I am well aware they are floating point and just didn't connect brain to fingers in the sort command.  

Since were talking about floating points, I had posted something on biostars asking a general question about using estimated read counts from Salmon or Kallisto as input to DESeq2.  Its pretty clear you feel this is a very good approach as your read the docs page explicitly endorses this.  I threw that post up to start a discussion about this as both packages are pretty new and Its not clear to me that a whole ton of data exists coming from this pipeline, Salmon/Callisto --> DESeq(2) using estimated read counts.  If your active on Biostars maybe you can chime in, thus far the thread has been a bit derailed with discussions of error induced by integer truncation (this seems entirely trivial to me) and could perhaps benefit from your perspective.  

Thanks again- great catch- I never got there. 
Reply all
Reply to author
Forward
0 new messages