Using STAR Aligner as INPUT for RSEM --sam Option

4,918 views
Skip to first unread message

Alexander Czerny

unread,
Aug 21, 2013, 7:50:49 AM8/21/13
to rna-...@googlegroups.com
Hi guys iam using RSEM instead of HTSeq for detecting counts on rnaseq data.

Therefore i need to align against a transcriptome instead of genome, because there no conversion of reference genome based  to reference transcriptome based  at the moment. My command therefore is as follows:

STAR --genomeDir mygenomedir --readFilesIn Sample_.fastq  --runThreadN 60 --genomeLoad NoSharedMemory  --alignIntronMax 1 --alignIntronMin 2 --outFileNamePrefix myprefix_Transcriptome_mm005_SE --outFilterMismatchNoverLmax 0.05 &

With that i get splice free alignments for RSEM input.
Sadly RSEm can just work with GAPLESS data so i need to grab all reads with just  matched or missmatched TAGs in the cigar string using grep.

egrep "^@|(\s[0-9]+[MX=]\s)+"     // header+gapless alignments


is there a possibility to do this on the fly with STAR, because i lose so much time with the egrep command that i just could use RSEM as a whole (but i want to use STAR ^^).

Thx in advance, Alex.



Alexander Dobin

unread,
Aug 21, 2013, 11:25:31 AM8/21/13
to rna-...@googlegroups.com
Hi Alex,

I think you should be able to get rid of all gaps and indels by using 
--alignIntronMax 1 --alignIntronMin 2  --scoreDelOpen -10000  --scoreInsOpen -10000  
I have not tried it myself,  please let us know if it works.

Can RSEM work with soft-clipped alignments, or you need to get rid of them as well?

Cheers
Alex

Alexander Czerny

unread,
Aug 27, 2013, 6:41:11 AM8/27/13
to
As far as i know and asked the rsem developer it is just allowed to use M,= and X in the cigar string.


EDIT: Here the text of the final.log file

                                 Started job on |       Aug 27 11:25:12
                             Started mapping on |       Aug 27 11:25:34
                                    Finished on |       Aug 27 11:34:08
       Mapping speed, Million of reads per hour |       601.63

                          Number of input reads |       85898850
                      Average input read length |       51
                                    UNIQUE READS:
                   Uniquely mapped reads number |       29587260
                        Uniquely mapped reads % |       34.44%
                          Average mapped length |       50.52
                       Number of splices: Total |       0
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       0
                       Number of splices: GC/AG |       0
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       0
                      Mismatch rate per base, % |       0.52%
                         Deletion rate per base |       0.00%
                        Deletion average length |       0.00
                        Insertion rate per base |       0.00%
                       Insertion average length |       0.00
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       42773929
             % of reads mapped to multiple loci |       49.80%
        Number of reads mapped to too many loci |       1892999
             % of reads mapped to too many loci |       2.20%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       13.43%
                     % of reads unmapped: other |       0.13%


 Seems good to me just 13,43% are not mapped and there is no Deletion/Insertion event. Thx for the help, only the softclipping has to go, then it should work with RSEM.

Alexander Dobin

unread,
Aug 27, 2013, 4:31:47 PM8/27/13
to rna-...@googlegroups.com
Hi Alex,

unfortunately there is no way right now to prevent soft-clipping. Or rather, you can require the number of matches to be equal to read length, but then you will get rid of all reads with mismatches which is not good.
I would have to change the algorithm to force the full length extension of the reads.

Cheers
Alex

Alexander Czerny

unread,
Aug 28, 2013, 1:31:31 AM8/28/13
to rna-...@googlegroups.com
Hi Alex,

thats a pitty. Thx for your help till now, if i find some way arround it, i will post it. Thx for your great support and i love STAR :D.

XiaoYu Liu

unread,
Oct 29, 2013, 11:48:13 AM10/29/13
to rna-...@googlegroups.com
I was searching for a way to use STAR with RSEM too. But was not able to find a solution.

On the other hand, is it possible to extract transcript sequences from the STAR alignment, and use the transcripts as the templates for the downstream RSEM analysis?

What software can be used to reconstruct transcriptome from the genome alignment? I know Cufflink and Scripture probably can do that. Any other  would work?

Thanks!




Alexander Dobin

unread,
Nov 10, 2013, 10:40:02 PM11/10/13
to rna-...@googlegroups.com
Hi XiaoYu,

here is what you can do to make STAR work with RSEM;
1. Run STAR with --alignIntronMax 1 --alignIntronMin 2  --scoreDelOpen -10000  --scoreInsOpen -10000  
This should prevent junctions and indels - however, it will still contain soft-clipping
2. Run this awk script to remove soft-clipping:
awk 'BEGIN {OFS="\t"} {split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/); if (C[2]=="S") {$10=substr($10,L[1]+1); $11=substr($11,L[1]+1)}; if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1]; $10=substr($10,1,L1); $11=substr($11,1,L1); }; gsub(/[0-9]*S/,"",$6); print}' Aligned.out.sam > Aligned.noS.sam

Hopefully the resulting file will be compatible with RSEM.
Please share your experience if you try it out
Cheers
Alex

Alexander Czerny

unread,
Dec 11, 2013, 8:12:33 AM12/11/13
to
Hi Alex,

i got this problem after using the awk script:
/mountpoints/kfoshare/programs/rsem/rsem-run-em ..." failed! Plase check if you provide correct parameters/options for the pipeline!
[samopen] SAM header is present: 195584 sequences.
The alignment of read HWI-ST933:97:D1P8MACXX:1:1210:7463:29613 to transcript 150006 starts at -1 from the forward direction, which should be a non-negative number! It is possible that the aligner you use gave different read lengths for a same read in SAM file.
The alignment of read HWI-ST933:97:D1P8MACXX:1:1101:17149:2421 to transcript 121631 starts at -1 from the forward direction, which should be a non-negative number! It is possible that the aligner you use gave different read lengths for a same read in SAM file.
The alignment of read HWI-ST933:97:D1P8MACXX:1:2208:8490:2214 to transcript 94115 starts at -1 from the forward direction, which should be a non-negative number! It is possible that the aligner you use gave different read lengths for a same read in SAM file.
The alignment of read HWI-ST933:97:D1P8MACXX:1:1201:21319:5238 to transcript 41001 starts at -1 from the forward direction, which should be a non-negative number! It is possible that the aligner you use gave different read lengths for a same read in SAM file.

A read grep from HWI-ST933:97:D1P8MACXX:1:1210:7463:29613 before and after the awk script shows these results, do u know what the problem might be?
before awk:
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    16    ENST00000454798.2    1939    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:1    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000248929.9    2042    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:2    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000485962.1    2227    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:3    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000470518.1    621    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:4    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000478085.1    3000    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:5    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000427834.1    188    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:6    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000469719.1    113    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:7    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000462457.1    71    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:8    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000417424.1    1    0    1S42M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:9    AS:i:41    nM:i:0


after awk:
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    16    ENST00000454798.2    1939    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:1    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000248929.9    2042    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:2    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000485962.1    2227    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:3    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000470518.1    621    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:4    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000478085.1    3000    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:5    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000427834.1    188    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:6    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000469719.1    113    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:7    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000462457.1    71    0    43M    *    0    0    GGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    IJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:8    AS:i:42    nM:i:0
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352    272    ENST00000417424.1    1    0    42M    *    0    0    GTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT    JHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C    NH:i:9    HI:i:9    AS:i:41    nM:i:0

EDIT:
It seems that the leftmost position has not changed, shouldnt it be now 2 instead of 1 ?
I did a mini example wihtout reads that were altered by the awk script:
...
@SQ    SN:ENSTR0000583047.1    LN:200
@SQ    SN:ENSTR0000578699.1    LN:200
@SQ    SN:ENSTR0000581137.1    LN:199
@SQ    SN:ENSTR0000580687.1    LN:199
@SQ    SN:ENSTR0000577896.1    LN:439
@SQ    SN:ENSTR0000580266.1    LN:200
@SQ    SN:ENSTR0000577553.1    LN:226
@SQ    SN:ENSTR0000285718.7    LN:1729
@SQ    SN:ENSTR0000483543.2    LN:898
HWI-ST933:97:D1P8MACXX:1:1207:13121:11352\t16\tENST00000454798.2\t1939\t0\t43M\t*\t0\t0\tGGTTGGATGAAGATGGCAAAGTCCTGACCCCGGAGGAGCTGCT\tIJHGJGIGGJIGGEJIHIIGD<GIIHFJJJHHHHHFFFFFB@C\tNH:i:1\tHI:i:1\tAS:i:42\tnM:i:0
...

and it worked:
Expression Results are written!
[samopen] SAM header is present: 195584 sequences.
Bam output file is generated!
Time Used for EM.cpp : 0 h 00 m 08 s

Greetings, Alex.

Alexander Dobin

unread,
Dec 11, 2013, 3:34:17 PM12/11/13
to rna-...@googlegroups.com
Hi Alex,

I think RSEM complains because after the awk script removed the "S" from CIGAR, the same read has different "lengths" of its multiple alignments, which is not good indeed.
I guess we need to replace "S" with "M", not simply clip it. I will update the awk script and send it to you shortly.

Cheers
Alex

Alexander Czerny

unread,
Dec 12, 2013, 3:08:11 AM12/12/13
to rna-...@googlegroups.com
Hi Alex,

ah okay seems logic. I'am very thankful for your help.


greetings, Alex.

Alexander Dobin

unread,
Dec 21, 2013, 8:22:43 AM12/21/13
to rna-...@googlegroups.com
Hi Alex,

I have implemented the end-to-end alignment in STAR, please try this patch:
To prohibit soft-clipping you would need to use
--alignEndsType EndToEnd

Please let me know if it works for you, I have not tested it very thoroughly yet.
Cheers
Alex

Alexander Czerny

unread,
Jan 7, 2014, 6:26:46 AM1/7/14
to
HI Alex, i did use your new version but its not fully "unclipped" , i paid attention on indel and splices:
                  Average mapped length |    51.00

                       Number of splices: Total |    0
            Number of splices: Annotated (sjdb) |    0
                       Number of splices: GT/AG |    0
                       Number of splices: GC/AG |    0
                       Number of splices: AT/AC |    0
               Number of splices: Non-canonical |    0
                      Mismatch rate per base, % |    0.94%

                         Deletion rate per base |    0.00%
                        Deletion average length |    0.00
                        Insertion rate per base |    0.00%
                       Insertion average length |    0.00

EDIT: i uploaded for you the 4491 alignments from my sam file, if u need them.

used your new version:

STAR svn revision compiled=STAR_2.3.1v_r378
STAR compilation time,server,dir=Sat Dec 21 00:04:57 EST 2013 verona.cshl.edu:/data/gingeras/user/dobin/STAR/SandBox/STAR/STAR_2.3.1v

and counted the cigar strings after the mapping:

12963543 *
   4491 1S50M
21671767 51M

so there are still 4491 softclipped reads in the data.

EDIT 2: After excluding these reads from the sam-file RSEM went sucessfully through.
softclips.sam

Alexander Dobin

unread,
Jan 9, 2014, 5:16:37 PM1/9/14
to rna-...@googlegroups.com
Hi Alex,

thanks for reporting this - it was a bug that is fixed in this patch:
Cheers
Alex


On Monday, January 6, 2014 8:40:08 AM UTC-5, Alexander Czerny wrote:
HI Alex, i did use your new version but its not fully "unclipped" , i paid attention on indel and splices:
                  Average mapped length |    51.00
                       Number of splices: Total |    0
            Number of splices: Annotated (sjdb) |    0
                       Number of splices: GT/AG |    0
                       Number of splices: GC/AG |    0
                       Number of splices: AT/AC |    0
               Number of splices: Non-canonical |    0
                      Mismatch rate per base, % |    0.94%

                         Deletion rate per base |    0.00%
                        Deletion average length |    0.00
                        Insertion rate per base |    0.00%
                       Insertion average length |    0.00

Shawn Driscoll

unread,
Jan 10, 2014, 4:04:25 PM1/10/14
to rna-...@googlegroups.com
I posted this over in the corresponding discussion in RSEM's group:

"One thing I'd be cautious of when using STAR, or at least something to consider, is when I export unmapped reads from STAR (which I've aligned for use with eXpress or RSEM) I've found that often 10 to 50% of those unmapped reads can be mapped with Bowtie.  That's from attempting to have approximately the same alignment parameters set in both aligners.  I've made no attempt to discover if those reads SHOULD be mapped but many of them have only between 0 and 2 mismatches so I assume they are pretty good matches and at least valid as input for RSEM/eXpress. Here are my command lines just for reference (single end example):

star --runMode alignReads --genomDir <path> --outFilterMismatchNoverLmax 0.2 --outFilterMatchNminOverLread 0.95 \
  --alignIntronMax 1 --alignIntronMin 2 --outReadsUnmapped Fastx --outFilterMultimapNmax 500 --readFilesIn <reads>

Then aligning the unmapped reads afterwords with bowtie[1]

bowtie -n 2 -e 500 -m 500 -k 500 -S <ref> <reads>

In the data I'm testing now the STAR stage mapped about 16M reads and the following bowtie stage mapped an additional 8M reads...so it's no small chunk! 

So for good practice it's not a bad idea to run an initial mapping with STAR to cut down on the amount of data that has to be mapped with bowtie but to also still run bowtie on the unmapped data from STAR since it doesn't do a complete job.  Or maybe I'm not using it right. 

Alexander Czerny

unread,
Jan 14, 2014, 2:21:55 AM1/14/14
to rna-...@googlegroups.com
Hi Alex,

i tested your new version of STAR and RSEM went through smoothly. Thanks for your great work.

Alexander Dobin

unread,
Jan 15, 2014, 6:52:39 PM1/15/14
to rna-...@googlegroups.com
Hi Shawn,

thanks for this observation - it is worrisome for me. Could you share some part of this dataset, a few thousands of the reads that were not mapped by STAR, but mapped by bowtie, and the annotations from which you generated the transcriptome reference.

Cheers
Alex

Shawn Driscoll

unread,
Jan 15, 2014, 8:08:54 PM1/15/14
to rna-...@googlegroups.com
I'd be happy to share some reads.  I spent some time yesterday trying to work with the alignment settings and rules for STAR but I wasn't able to get STAR to map as many reads as bowtie2 (or bwa for that matter).  I'll make a subset for you tomorrow - I should have time.  Because of this I typically keep the unmapped reads from STAR and map them with bowtie2 afterwards and merge the set before running eXpress.  I don't bother to do anything with unmapped reads from genome alignments with STAR - I don't think there's any reason to think any other software will beat it right now for that.

Shawn

Shawn Driscoll

unread,
Jan 16, 2014, 7:32:16 PM1/16/14
to rna-...@googlegroups.com
Hi Alex,

I'm investigating this more today.  It's a little difficult to get each aligner to work in the same way and I have avoided doing post filtering to control for things like maximum number of mismatches and indels/clipping, etc.  I have a feeling, however, that the large difference I saw in what I noted there has a lot to do with my STAR alignment options.  I guess I'm just starting to understand them.  For example to control maximum mismatches with bowtie2 you have to properly set mismatch penalties, possibly ignore base qualities and then adjust the minimum score for reporting alignments.  Until recently I thought that I could control maximum mismatches in STAR by ONLY changing the "outFilterMismatchNoverLmax" value OR the "outFilterMismatchNmax" setting.  Now (I think) I understand that I have to set both of those and either one can be a limit depending on which is hit first.  So I've taken to setting the latter to a very high number (> length of my reads) and the former to somewhere between 0.04 and 0.1 depending on how lax I want the setting to be.  It seems like with those options combined and the new end-to-end alignment mode I get mismatches in the range that I'm looking for.  So using the 0.1 setting with 85 base reads I'm getting up to 8 mismatches which is what I wanted.

Another example is I didn't fully understand what was going on with the multi-alignment allowance.  Most other aligners have a single value you set to allow reporting of multi-mappers but again it seems that you have to set two values in STAR.  I just figured out that I can get more multi-mapping output by increasing the value of "outFilterMultimapScoreRange" in combination with a high value for "outFilterMultimapNmax".  In other aligners when you ask for multi-mappings they report as many as you ask for (or as many as are found) with scores down to whatever the overall minimum alignment score is for the entire alignment job.  Ideally I'd like to be able to do this same thing in STAR for running alignments into eXpress or RSEM. So maybe I can set "outFilterMultimapScoreRange" to something large (maybe larger than the read length?) and then I would effectively be using the same limit based on the minimum alignment score setting.

I also didn't fully understand the "outFilterMatchNminOverLread" setting. I thought I was using it to control soft-clipping but now I see that it limits the alignment score so in the example where bowtie2 aligned 8M more reads I think I had been restricting the alignment score to 98 or something very high while bowtie2 was being allowed a lot of wiggle room.

I've run some more tests on the same single-end data and now the difference between STAR and bowtie2 are much smaller.  bowtie2 aligned 57% of the reads and star aligned 56%.  About 4% of the reads aligned by bowtie2 were not aligned by STAR and about 2% of the reads aligned by STAR were not aligned by bowtie2.  So there's some difference on both sides.  From what I can tell the bowtie2 unique alignments tend to have a combination of high mismatches and 1 to 2 indel events while the STAR unique alignments are more commonly just high mismatches with the occasional indel.  Oddly one read that simply had 8 mismatches that STAR aligned I couldn't get bowtie2 to find an alignment no matter what setting I used.  bowtie1, however, aligned it no problem in the exact same location as STAR.

So what I'm seeing is probably more aligner noise and possibly an artifact of not getting the exact same mismatch/minimum score/indel allowance settings between aligners. 

I tried my test data with bowtie2, STAR, GEM and BWA (aln) and each of them had their own unique bag of reads that they mapped that the others didn't.  So, in conclusion, if one really wants to get as much data aligned as possible one should actually use all of the currently available aligners.

I'll edit my post over in the RSEM group to reflect what I'm seeing here. 

Any input from you regarding how the STAR aligner settings can be manipulated to allow similar alignment policies to tools like BWA or bowtie would be super helpful.  Hopefully I'm on the right track in terms of my understanding of how the paired settings for mismatches and multi-mappers work.

Alexander Dobin

unread,
Jan 18, 2014, 12:14:41 PM1/18/14
to rna-...@googlegroups.com
Hi Shawn,

thanks a lot for sharing this detailed analysis. Your observations about the behavior of STAR parameters are right to the point.

The output filters in STAR are applied in the "AND" fashion, so the most stringent filter will control the final decision.
If you want only one of the similar filters to work, you have to make the other completely nonrestrictive - as you did with --outFilterMismatchNoverLmax 0.1 --outFilterMismatchNmax  <very large>
Of course, if you wanted to set absolute number of mismatches Nmm rather than relative, you could do --outFilterMismatchNoverLmax 1 --outFilterMismatchNmax  <Nmm>.

The logic for the multi-mapping reads is indeed different from the DNA aligners since STAR was originally designed for RNA alignments. As far as I understand, DNA aligners typically report the multi-mapping alignments to define the "mapping quality" of the best alignment. On the other hand, for RNA-seq we typically care about all possible loci from which a read could have originated - so we only need to explore alignment scores very close to the best score. STAR will report all multimapping alignments with the scores >= (bestScore-outFilterMultimapScoreRange). If the number of these alignments is > outFilterMultimapNmax, no alignments of this read will be output - this is just to prevent pollution of the output.
If you set --outFilterMultimapScoreRange to a very large number (> read length), STAR will output a lot of suboptimal alignment pieces, more or less like BLAT does - I am not sure if this is very useful. I guess I could introduce an option to limit the absolute minimum alignment score for the sub-optimal alignments.

outFilterMatchNminOverLread is restricting the minimum number of the bases matched to the genome, excluding mismatches or indels. If you set it very high it will indeed limit the number of allowed mismatches. Other parameters of similar meaning are (again, these filter will be applied in the "AND" fashion):
outFilterMatchNmin              0
    int: alignment will be output only if the number of matched bases is higher than this value
outFilterScoreMin               0
    int: alignment will be output only if its score is higher than this value
outFilterScoreMinOverLread      0.66
        float: outFilterScoreMin normalized to read length (sum of mates' lengths for paired-end reads)
By default these filters are quite nonrestrictive, allowing alignments longer than 2/3 of the read length.

Your observation about fewer indels found by STAR than the DNA aligners is correct - STAR is very cautious about the indels that are close to read ends.
Accurate identification of such indels is not easy for RNA-seq, since they could also be mapped as splice junctions with much longer gaps.
My preference is to have a lower sensitivity but higher precision. For DNA alignments, calling short indels near the ends is important, and I will have to think about an efficient algorithm to detect them. 

I would like to make STAR more friendly to DNA in the future, but I would have to delve into DNA alignnment - I do not have much experience with it.
I think you have figured out the main points already: end-to-end alignments vs. local, controlling the number of mismatches, controlling the multi-mappers, prohibiting splicing.

Cheers
Alex

Shawn Driscoll

unread,
Jan 18, 2014, 4:28:44 PM1/18/14
to rna-...@googlegroups.com
Interesting.  Thank you for the detailed response.  So when we align rna-seq reads directly to transcripts we're actually doing a DNA type alignment which, as it sounds, requires a slightly different strategy. So at some super fine-grain level STAR isn't necessarily appropriate for this task (or not as appropriate as using a DNA mapper). Though it does work quite well once the settings are right.

Thanks for the insight.
Reply all
Reply to author
Forward
0 new messages