Different target Length between samples

Jora Lin

unread,

Feb 5, 2020, 9:41:39 PM2/5/20

to Sailfish Users Group

Hi,

I am a user of illumina BaseSpace, and I use a Salmon to analyze 5 RNA-seq samples.

I found the Length of 5 samples is different, is it normal? does this affect the accuracy of TPM?

Thank you so much!

Rob

unread,

Feb 5, 2020, 9:50:34 PM2/5/20

to Sailfish Users Group

Hi Jora,

I am not familiar with any transformation done by BaseSpace before returning these "Length" values. However, it is absolutely expected that

the effective length of the transcripts changes between samples. This is because the effective length of a transcript depends upon the fragment

length distribution within each sample, and this distribution changes (though usually only moderately) between samples. Assuming these length

values are derived somehow from salmon's effective lengths, then I would expect them to change a bit between samples and that is fine

(actually, it is required to maximize the accuracy of TPM computation).

--Rob

Jora Lin

unread,

Feb 5, 2020, 10:17:19 PM2/5/20

to Sailfish Users Group

Dear Rob,

Thank you so much for your reply!

This value is generated from the quantification file: https://salmon.readthedocs.io/en/latest/file_formats.html#quantification-file

and the explanation of Length is: This is the length of the target transcript in nucleotides.

I understand the "effectivelength" will be adjusted by the reads around the sample, but I don't understand why the target transcript length will be changed during quantification. ( I thought it is a fixed value, because the original transcript length should not change in difference sample). if you know why?

I really appreciate your help!

Best,

Jora

Rob

unread,

Feb 5, 2020, 10:21:33 PM2/5/20

to Sailfish Users Group

Hi Jora,

You are correct. The "Length" is not expected to change between samples. It is based on the length of the transcript in the index, and is not

a sample-specific quantity. Moreover, the "Length" should always be an integral value, so I'm not sure why you are seeing non-integer lengths

here. Can you say a bit about how you are getting these length values into your table (which looks to be an Excel spreadsheet)? Do you

have access to the raw `quant.sf` files for these samples?

--Rob

Jora Lin

unread,

Feb 5, 2020, 10:38:01 PM2/5/20

to Sailfish Users Group

Dear Rob,

Thank you for the explanation.

I download the quant.gene.sf from app , and only copy and paste to the excel file.

I attached the raw files of sample5 which can see a lot of float value in the Length column :(

Should I trust the TPM value of the gene which has float value in the Length column?

Best,

Jora

Rob於 2020年2月6日星期四 UTC+8上午11時21分33秒寫道：

sample5.quant.genes.sf

Rob

unread,

Feb 5, 2020, 10:44:33 PM2/5/20

to Sailfish Users Group

Dear Jora,

Ahhh... this explains a lot. Thank you for uploading the file! This is a `quant.genes.sf` file, not a `quant.sf` file.

The difference here is that this is the abundance values aggregated to the gene level, rather than reported at the

transcript level. In this case, it is expected for the Length field to be both variable between samples and non-integral.

This is because gene-level abundances are determined by summing the abundances of the gene's constituent transcripts,

while the gene "Length" is computed by taking an abundance-weighted mixture of the underlying expressed transcripts.

Because isoform composition can change between samples, so can this length field. This is expected behavior in a

quant.genes.sf file, but should not happen in a transcript-level quant.sf file.

--Rob

P.S. One minor note. Though one can get gene-level abundances from salmon in this manner by providing a gene to transcript

mapping, the recommended route for aggregating transcript level abundances to the gene level is to get transcript level abundances

from salmon and then to use tximport (https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html) to

aggregate these to the transcript level. That is because tximport views all of the input quantifications (over all samples) at once, and

can therefore properly adjust for differences in average gene length between samples when preparing abundances for differential

testing. If you plan to use this values for gene-level DE analysis, the current approach will work, but the transcript-level salmon -> tximport

approach is preferred.

Jora Lin

unread,

Feb 5, 2020, 11:07:35 PM2/5/20

to Sailfish Users Group

Dear Rob,

Thank you so much!!

Can I understand that the target length of gene level will be adjusted by the reads in the samples, so different samples may get the different lengths in quant.gene.sf.

So, if the Length of different samples is different, the TPM value should not be compared, in this situation, I need to use tximport to convert the transcript into gene by using quant.sf file to get a more accurate TPM value of genes between samples.

am I understand correct?

About tximport, is it a R package? the function you suggest is below?:

tximport for salmon.PNG

You are so amazing, I really appreciate you!!

Best

Jora

Rob於 2020年2月6日星期四 UTC+8上午11時44分33秒寫道：

Rob Patro

unread,

Feb 5, 2020, 11:16:15 PM2/5/20

to sailfis...@googlegroups.com

Dear Jora,

Yes, the target length of the gene will be adjusted by the reads in the samples. This actually doesn't make it inaccurate. If the gene consists of different isoform proportions in different samples, it actually makes sense for the gene length to be considered as different.

However, it's also true that the transformation done by tximport is even better than simply scaling the gene length in each sample by the isoform decomposition within that sample. It can look across all samples in the experiment, and determine a global, average gene length that can be best used to compare the abundances between samples. The code you show in your post (from the tximport vignette) is exactly what you want to use. You can use BaseSpace to get the transcript-level results from salmon for your samples. Then you read them into R with tximport as below, and that is where you provide the transcript to gene mapping. Tximport will load all of the data for you, and aggregate the transcript abundances to the gene level. Moreover, it will prepare a dataframe that can be directly passed off to e.g. DESeq2, if you're planning on doing differential expression analysis. On the other hand, if you simply want to do some visualization and comparisons (e.g. look at a PCA of your samples), the dataframe generated by tximport is appropriate for that as well, and you can do that with a few lines as Mike points out here (https://support.bioconductor.org/p/102202/).

I'm glad I could be helpful here.

Best,
Rob

--
Sailfish is available at https://github.com/kingsfordgroup/sailfish
Citation:
Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.
---
You received this message because you are subscribed to the Google Groups "Sailfish Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sailfish-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sailfish-users/57d722be-9f53-4288-af88-edb157997b0f%40googlegroups.com.

Message has been deleted

Jora Lin

unread,

Feb 6, 2020, 4:13:14 AM2/6/20

to Sailfish Users Group

Dear Rob,

Thank you so much for all the explanations and helps of this topic.

It is really helpful :)

Best

Jora

Rob於 2020年2月6日星期四 UTC+8下午12時16分15秒寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to sailfis...@googlegroups.com.

Reply all

Reply to author

Forward