systematic bias on transcripts's length using Salmon ?

119 views

Skip to first unread message

bik...@hotmail.fr

unread,

May 19, 2017, 10:15:10 AM5/19/17

to Sailfish Users Group, bruno.s...@parisdescartes.fr

Dear all,
first I'm not a specialist in RNAseq data analysis so I apologize if my question sounds a bit naive...

We generated a RNAseq of 2 related but distinct freshly isolated cell populations, say C1 and C2 (single-end 75 bp reads, 3 biological replicates, 4 technical replicates).
Salmon was used to calculate TPM and transcripts' counts. These were summed within gene using Tximport and DESeq2 was used for DGE analysis

Salmon 0.8.0 was run in the quasi-mapping mode with the following options :
-fldMean=208 and -fldSD=52 (empirical estimation from electrophoresis analysis of the fragments)
-GCbias
-Seqbias

Tximport was called with :
txi <- tximport(files, type="salmon", countsFromAbundance="no", tx2gene=tx2gene)

The DESeqDataSet was generated with :
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

Question 1
When I plot the average transcript length for the 19315 genes of the dataset (see attached figure), there seems to be a systematic bias towards higher values in C1 vs C2
However, when I ran Salmon without -seqBias, and even more without both -gcBias-seqBias, average transcripts lengths are similar in C1 and C2 (I mean there are as many genes with average trancripst length larger in C1 than larger in C2), which sounds more as expected (see attached figure)
I'd like to figure out what is going wrong since errors on the average effective length might impact DGE analysis.
Can you please comment on that?

Question 2
This is a more general and I guess there might not be a definite answer to it...
How should I estimate a threshold for the genes to be considered for further analysis? So far I keep only genes with baseMean > 1000 (6363 out of 19315 genes) but I have no rationale for this

Best regards
bruno