Very strict alignment of RNAseq reads using STAR

1,699 views
Skip to first unread message

Rahul Kanwar

unread,
Jul 2, 2013, 12:01:41 PM7/2/13
to rna-...@googlegroups.com
Hello,

   I have ~500M 200bp paired end RNAseq reads from mouse. I am interested in doing a de-novo transcript reconstruction using cufflinks. To achieve this I have decided upon the following parameters:

a) Consider only those reads where both the ends align concordantly [no chimeric reads or where one end align and the other dosen't]
b) In case a read spans a splice junction then length of the overhang should be at-least 40bp on either side [I have ~500M reads so at-least a few of them will satisfy this contraint for most of the splice junctions] 
c) In case the splice overhang is less than 40bp on one of the sides then soft clip the read 
d) Keep only uniquely mapping paired end read in the output SAM file
e) Take the illumina error rate as 3% of the read length

  To achieve this I used the following command:


STAR \
--runThreadN                            16 \
--genomeDir                             ~/mouse/star \
--readFilesIn                             ~/fq1.fastq.gz \
                                               ~/fq2.fastq.gz \
--readFilesCommand                 zcat \
--outFileNamePrefix                  ./ \
--outFilterIntronMotifs                RemoveNoncanonical \
--outSJfilterReads                     Unique \
--alignSJoverhangMin                40 \
--alignSJDBoverhangMin            40 \
--outFilterMultimapNmax            1 \
--outFilterMismatchNoverLmax   0.03 \
--outStd SAM | samtools view -bS - > star.bam

  I think the above command achieves my goals [any advise to improve it is most welcome], but I could not figure out how to tell STAR to report only 'properly' paired alignments. Can someone help me with it ? Thanks.

Regards,
Rahul

Shawn Driscoll

unread,
Jul 2, 2013, 6:19:05 PM7/2/13
to rna-...@googlegroups.com
STAR always only reports concordant alignments. This is it's normal, and I think only, behavior. 

I think your command looks good.  The only thing I'm not sure about is your "soft clip if less than 40bp overhang" expectation.  I'm not sure how to tell STAR to specifically do that.  You can tell it to not report those junctions but I don't think that necessarily means it will report soft clipped alignments in their place.

Alexander Dobin

unread,
Jul 5, 2013, 11:06:23 PM7/5/13
to rna-...@googlegroups.com
Hi Rahul,

Shawn is right, STAR output only correctly paired alignments by default, although there is a way to output single-end alignments (which I do not recommend).
If you restrict overhang with --alignSJoverhangMin  40 --alignSJDBoverhangMin  40, STAR may output a soft-clipped alignment instead of a junction, however, it may also output a completely different alignment if it's score is higher than the score of the soft-clipped alignment. This is dangerous since it may lead to false alignments - for example to pseudogenes. 
If you really want to get rid of all alignments with overhangs shorter than 40, I would recommend the following procedure:
1. Map with default restrictions on overhangs: --alignSJoverhangMin  5 --alignSJDBoverhangMin  3
2. Filter out reads with short overhangs from CIGARs in the SAM file, or replace them with soft-clips.

Cheers
Alex

Shawn Driscoll

unread,
Jul 6, 2013, 2:16:40 AM7/6/13
to rna-...@googlegroups.com
I don't exactly understand why so much overlap is necessary.  To some extent this is going to bias away from some exons that aren't even 40 bases long.  Consider also that sometimes a single alignment can jump across two introns. I never really thought about these things until I started writing my own alternative exon inclusion analysis.  In addition I think the canonical junctions that STAR reports (even if they aren't annotated) have the extra support of the detected intron motif plus however many times the junction was read.  Anyways, just my two cents.

You might consider tweaking the cufflinks settings which allows for customizing the minimum amount of overhang required for a spliced alignment to be considered in the analysis.
This is the '--small-anchor-fraction' setting which is set to 0.09 by default.  That is, for a 100bp read the minimum overhang across a junction is 0.09*100 or 9 bases.  You can change that to 0.4 and that should solve your problem.  You may also want to tweak the '--min-isoform-fraction',  '--min-frags-per-transfrag' and '--overlap-radius' settings to fine tune the output a little depending on how much information you want to get back.

-shawn
Reply all
Reply to author
Forward
0 new messages