Identical transcripts in fasta file

270 views
Skip to first unread message

dmr210

unread,
Apr 25, 2017, 5:44:33 AM4/25/17
to Sailfish Users Group

Hi,

I am using Salmon to quantify transcript expression based on RNAseq.

I am using the ensembl annotation, and as I am interested in non-coding RNA (lncRNA in particular) I merged the "cdna.all" and the "ncrna" fasta files. (see ftp://ftp.ensembl.org/pub/release-87/fasta/mus_musculus/)

After looking at these two transcriptomes in more details, I found that 20 transcripts are common between the two files, i.e. they have the same ID.

My question is relatively silly... but I wasn't able to answer it based on the documentation or FAQ:

Would Salmon get 'confused' by this and consider the reads as ambiguous in some way, or would it 'notice' that the ID is identical and 'ignore' the repetition?

Thanks!

Rob

unread,
Apr 27, 2017, 2:19:36 PM4/27/17
to Sailfish Users Group
Hi dmr210,

  The question isn't silly at all!  Currently, Salmon indexes the provided transcriptome as given.  So, if the transcriptome contains duplicates, reads will map equally well to both.  Thus, I would recommend removing duplicates from the transcriptome prior to indexing.
Though you refer here only to transcripts that are complete duplicates (even in terms of their name), I'd actually recommend checking for duplicates at the sequence level, since ensembl likes, sometimes, to put the same transcript in the txome multiple times with
different names (sometimes with the only difference being the biotype).  Actually, I'm currently adding a feature to the indexer to detect and optionally remove sequence-level duplicates automatically.  This should make it into the next release of Salmon.  You should
vote in this poll about what default behavior you would prefer (remove duplicates by default with an option to keep duplicates, or keep duplicates by default with an option to remove them).

Best,
Rob
Reply all
Reply to author
Forward
0 new messages