using STAR+Cufflinks for transcript assembly turns unstranded RNA-seq to stranded?

Runxuan zhang

unread,

Jul 22, 2015, 10:24:45 AM7/22/15

to rna-...@googlegroups.com

I am trying to use STAR+Cufflinks to do a reference based transcript assembly using unstranded RNA-seq data.

As mentioned in the STAR manual "If you have un-stranded RNA-seq data, and wish to run Cufflinks/Cuffdiff on STAR alignments, you will
need to run STAR with --outSAMstrandField intronMotif option, which will generate the XS strand attribute for all alignments that contain splice junctions"

Thus in the generated SAM file, strand will be derived from the intron motif. Unstranded RNA-seq data will be assigned a strand, which results in a lot of genes have both sense and antisense transcripts in the merged transcript assembly.

My questions are:

1) how reliable is the derived strand info from intron motif?
2) Is the assembled transcripts affected by this?

Thank you very much!

Runxuan

Alexander Dobin

unread,

Jul 23, 2015, 5:36:48 PM7/23/15

to rna-star, zhangr...@gmail.com, zhangr...@gmail.com

Hi Runxuan,

strand derived from the intron motifs is very reliable: the error could only be caused by reads spliced across non-canonical junctions with CT/AC intron motif, which is very unlikely.

Cufflinks requires this strand determination for spliced reads, and cannot work without it.

Assembled transcripts will be affected by it - all spliced transcripts will be given a strand, and, as you stated, there transcripts on opposite strands of each other may be detected.

Cheers

Alex

Runxuan zhang

unread,

Jul 24, 2015, 5:33:27 AM7/24/15

to rna-star, ado...@gmail.com

Thank you very much, Alex. That helps a lot!

barry kesner

unread,

Sep 24, 2015, 10:15:57 AM9/24/15

to rna-star, zhangr...@gmail.com

Alex,

You said in this post that strand calling was very accurate. I have strand specific RNA-seq data. I have interrogated my sam files generated with XS:A: requested. Here are my results

count sam_flag last_field_in_sam_file

757773 147 RG:Z:ref.CAb15_8Sep_E1

157421 147 XS:A:-

6176 147 XS:A:+

8569543 163 RG:Z:ref.CAb15_8Sep_E1

714631 163 XS:A:-

1589699 163 XS:A:+

7144052 339 RG:Z:ref.CAb15_8Sep_E1

830819 339 XS:A:-

1641485 339 XS:A:+

1049871 355 RG:Z:ref.CAb15_8Sep_E1

10422 355 XS:A:-

7391 355 XS:A:+

1049871 403 RG:Z:ref.CAb15_8Sep_E1

10422 403 XS:A:-

7391 403 XS:A:+

7144052 419 RG:Z:ref.CAb15_8Sep_E1

830819 419 XS:A:-

1641485 419 XS:A:+

8569543 83 RG:Z:ref.CAb15_8Sep_E1

714631 83 XS:A:-

1589699 83 XS:A:+

757773 99 RG:Z:ref.CAb15_8Sep_E1

157421 99 XS:A:-

6176 99 XS:A:+

My library in tophat terms is fr-first stranded. From these results, for my data, I don't think it is calling strands accurately.

Here is how I called STAR (@PG) from SAM header:

@PG ID:STAR PN:STAR VN:STAR_2.4.2a CL:STAR --runThreadN 8 --genomeDir /data/leelab/projects/common/mtrDmm10 --readFilesIn E1_R1.00.fastq E1_R2.00.fastq --outSAMstrandField intronMotif --outSAMattributes All --outSAMattrRGline ID:ref.CAb15_8Sep_E1 SM:p.all.nb --outFilterMismatchNmax 36 --clip3pNbases 100

I am stating the above because I need strand specific (XS:A:) fields in all my SAM reads. I was thinking of removing the ones that were generated and replacing them based on knowing library type and if the mapping of the reads were rc or not (along with which read). Your thoughts are welcome!

Regards,

Barry

Alexander Dobin

unread,

Sep 25, 2015, 3:21:53 PM9/25/15

to rna-star, zhangr...@gmail.com

Hi Barry,

For two groups of reads (FLAG=147 and FLAG=99) the results make sense. For your RNA-seq library, 2nd read strand corresponds to the RNA strand (like Tru-Seq). The FLAG=147 means 2nd read, (-)strand, so the strannd of the original RNA must have been (-). The majority of the reads with the XS tag have it with (-) strand. A minority, ~3.8%, have a "wronng" strand, which is likely caused by the "strand errors" in the library making. All strand specific protocols I know of are prone to some degree of strand errors. 3.8% seems on the high side, typically it's <1%. FLAG=99 corresponds to the 1st read of the same pairs, so the numbers are reversed.

The 163/83 flags are for the reads from (+) strand RNAs, and for some reason the strand error rate is much higher. This is a very strange observation, and I cannot see how this can happen.

Could you please post/send the Log.final.out file from this run?

Another way to check the strandedness of your library is to calculate read counts per annotated genes with --quantMode GeneCounts.