Different target Length between samples

38 views
Skip to first unread message

Jora Lin

unread,
Feb 5, 2020, 9:41:39 PM2/5/20
to Sailfish Users Group
Hi, 

I am a user of illumina BaseSpace, and I use a Salmon to analyze 5 RNA-seq samples.
I found the Length of 5 samples is different, is it normal? does this affect the accuracy of TPM?

Thank you so much!

S__80936967.jpg

S__80945168.jpg

S__80945167.jpg


Rob

unread,
Feb 5, 2020, 9:50:34 PM2/5/20
to Sailfish Users Group
Hi Jora,

  I am not familiar with any transformation done by BaseSpace before returning these "Length" values.  However, it is absolutely expected that 
the effective length of the transcripts changes between samples.  This is because the effective length of a transcript depends upon the fragment 
length distribution within each sample, and this distribution changes (though usually only moderately) between samples.  Assuming these length
values are derived somehow from salmon's effective lengths, then I would expect them to change a bit between samples and that is fine 
(actually, it is required to maximize the accuracy of TPM computation).

--Rob

Jora Lin

unread,
Feb 5, 2020, 10:17:19 PM2/5/20
to Sailfish Users Group
Dear Rob,

Thank you so much for your reply! 
This value is generated from the quantification file: https://salmon.readthedocs.io/en/latest/file_formats.html#quantification-file
and the explanation of  Length is: This is the length of the target transcript in nucleotides.
I understand the "effectivelength" will be adjusted by the reads around the sample, but I don't understand why the target transcript length will be changed during quantification. ( I thought it is a fixed value, because the original transcript length should not change in difference sample). if you know why?

I really appreciate your help!

Best,
Jora

Rob

unread,
Feb 5, 2020, 10:21:33 PM2/5/20
to Sailfish Users Group
Hi Jora,

  You are correct.  The "Length" is not expected to change between samples.  It is based on the length of the transcript in the index, and is not 
a sample-specific quantity.  Moreover, the "Length" should always be an integral value, so I'm not sure why you are seeing non-integer lengths 
here.  Can you say a bit about how you are getting these length values into your table (which looks to be an Excel spreadsheet)?  Do you 
have access to the raw `quant.sf` files for these samples?

--Rob

Jora Lin

unread,
Feb 5, 2020, 10:38:01 PM2/5/20
to Sailfish Users Group
Dear Rob,

Thank you for the explanation.

I download the quant.gene.sf from app , and only copy and paste to the excel file.
I attached the raw files of sample5 which can see a lot of float value in the Length column :(

Should I trust the TPM value of the gene which has float value in the Length column?

Best,
Jora


Rob於 2020年2月6日星期四 UTC+8上午11時21分33秒寫道:
sample5.quant.genes.sf

Rob

unread,
Feb 5, 2020, 10:44:33 PM2/5/20
to Sailfish Users Group
Dear Jora,

  Ahhh... this explains a lot.  Thank you for uploading the file!  This is a `quant.genes.sf` file, not a `quant.sf` file.
The difference here is that this is the abundance values aggregated to the gene level, rather than reported at the 
transcript level.  In this case, it is expected for the Length field to be both variable between samples and non-integral.
This is because gene-level abundances are determined by summing the abundances of the gene's constituent transcripts,
while the gene "Length" is computed by taking an abundance-weighted mixture of the underlying expressed transcripts.
Because isoform composition can change between samples, so can this length field.  This is expected behavior in a
quant.genes.sf file, but should not happen in a transcript-level quant.sf file.

--Rob

P.S.  One minor note.  Though one can get gene-level abundances from salmon in this manner by providing a gene to transcript
mapping, the recommended route for aggregating transcript level abundances to the gene level is to get transcript level abundances
aggregate these to the transcript level.  That is because tximport views all of the input quantifications (over all samples) at once, and
can therefore properly adjust for differences in average gene length between samples when preparing abundances for differential 
testing.  If you plan to use this values for gene-level DE analysis, the current approach will work, but the transcript-level salmon -> tximport 
approach is preferred.

Jora Lin

unread,
Feb 5, 2020, 11:07:35 PM2/5/20
to Sailfish Users Group
Dear Rob,

Thank you so much!!
Can I understand that the target length of gene level will be adjusted by the reads in the samples, so different samples may get the different lengths in quant.gene.sf.
So, if the Length of different samples is different, the TPM value should not be compared, in this situation, I need to use tximport to convert the transcript into gene by using quant.sf file to get a more accurate TPM value of genes between samples.

am I understand correct?

About tximport, is it a R package? the function you suggest is below?:

tximport for salmon.PNG


You are so amazing, I really appreciate you!!

Best
Jora


Rob於 2020年2月6日星期四 UTC+8上午11時44分33秒寫道:

Rob Patro

unread,
Feb 5, 2020, 11:16:15 PM2/5/20
to sailfis...@googlegroups.com

Dear Jora,

  Yes, the target length of the gene will be adjusted by the reads in the samples.  This actually doesn't make it inaccurate.  If the gene consists of different isoform proportions in different samples, it actually makes sense for the gene length to be considered as different.

  However, it's also true that the transformation done by tximport is even better than simply scaling the gene length in each sample by the isoform decomposition within that sample.  It can look across all samples in the experiment, and determine a global, average gene length that can be best used to compare the abundances between samples.  The code you show in your post (from the tximport vignette) is exactly what you want to use.  You can use BaseSpace to get the transcript-level results from salmon for your samples.  Then you read them into R with tximport as below, and that is where you provide the transcript to gene mapping.  Tximport will load all of the data for you, and aggregate the transcript abundances to the gene level.  Moreover, it will prepare a dataframe that can be directly passed off to e.g. DESeq2, if you're planning on doing differential expression analysis.  On the other hand, if you simply want to do some visualization and comparisons (e.g. look at a PCA of your samples), the dataframe generated by tximport is appropriate for that as well, and you can do that with a few lines as Mike points out here (https://support.bioconductor.org/p/102202/).

I'm glad I could be helpful here.

Best,
Rob

--
Sailfish is available at https://github.com/kingsfordgroup/sailfish
Citation:
Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.
---
You received this message because you are subscribed to the Google Groups "Sailfish Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sailfish-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sailfish-users/57d722be-9f53-4288-af88-edb157997b0f%40googlegroups.com.
Message has been deleted

Jora Lin

unread,
Feb 6, 2020, 4:13:14 AM2/6/20
to Sailfish Users Group
Dear Rob,

Thank you so much for all the explanations and helps of this topic. 
It is really helpful :)

Best
Jora 

Rob於 2020年2月6日星期四 UTC+8下午12時16分15秒寫道:
To unsubscribe from this group and stop receiving emails from it, send an email to sailfis...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages