Re: Increase unique mapper ratio dramatically after providing splicing junction annotation

415 views
Skip to first unread message

Alexander Dobin

unread,
Apr 11, 2013, 6:44:19 PM4/11/13
to rna-...@googlegroups.com
Hi Wei,

the annotations improve mappability of the reads with short overhangs over splice junctions. If your reads are short, this could make all the difference in terms of unique/multi-mappers.
The magnitude of the effect is a bit surprising. If you could post the Log.final.out files from both sjdb/no-sjdb runs, I could try to look for an explanation.

Cheers
Alex

On Thursday, April 11, 2013 5:22:41 PM UTC-4, WEI WANG wrote:
Hi Alex,

I run STAR twice for my paired-end RNAseq datasets, with and without sjdb index.

Without sjdb (I ran it use STAR_2.3.0e), I got 35.26% unique mappers and 48.23% multiple mappers.

With sjdb annotation (STAR_2.3.1h), I got 74.32% unique mappers and 8.97% multiple mappers when --sjdbScore was set to 2;
72.98% unique mappers and 10.26% multiple mappers, when --sjdbScore was set to 0. 

What could be the explanation to the huge discrepancies of unique mapper ratio?

Thanks

Wei
Message has been deleted

WEI WANG

unread,
Apr 12, 2013, 10:32:13 AM4/12/13
to rna-...@googlegroups.com
The Log.final.out without sjdb:
 
                                 Started job on | Mar 30 23:42:53
                             Started mapping on | Mar 31 00:13:20
                                    Finished on | Mar 31 03:57:35
       Mapping speed, Million of reads per hour | 45.15

                          Number of input reads | 168752969
                      Average input read length | 180
                                    UNIQUE READS:
                   Uniquely mapped reads number | 59496854
                        Uniquely mapped reads % | 35.26%
                          Average mapped length | 176.46
                       Number of splices: Total | 2421397
            Number of splices: Annotated (sjdb) | 0
                       Number of splices: GT/AG | 2235903
                       Number of splices: GC/AG | 10639
                       Number of splices: AT/AC | 2550
               Number of splices: Non-canonical | 172305
                      Mismatch rate per base, % | 1.51%
                         Deletion rate per base | 0.06%
                        Deletion average length | 2.39
                        Insertion rate per base | 0.03%
                       Insertion average length | 2.36
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 81388848
             % of reads mapped to multiple loci | 48.23%
        Number of reads mapped to too many loci | 0
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 16.40%
                     % of reads unmapped: other | 0.12%

The one with sjdb:
Started job on | Apr 10 10:55:28
                             Started mapping on | Apr 10 11:10:00
                                    Finished on | Apr 10 13:17:35
       Mapping speed, Million of reads per hour | 79.36

                          Number of input reads | 168752969
                      Average input read length | 180
                                    UNIQUE READS:
                   Uniquely mapped reads number | 125416524
                        Uniquely mapped reads % | 74.32%
                          Average mapped length | 176.33
                       Number of splices: Total | 80268755
            Number of splices: Annotated (sjdb) | 77598712
                       Number of splices: GT/AG | 78967027
                       Number of splices: GC/AG | 763542
                       Number of splices: AT/AC | 205914
               Number of splices: Non-canonical | 332272
                      Mismatch rate per base, % | 1.58%
                         Deletion rate per base | 0.06%
                        Deletion average length | 2.46
                        Insertion rate per base | 0.04%
                       Insertion average length | 2.24
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 15132014
             % of reads mapped to multiple loci | 8.97%
        Number of reads mapped to too many loci | 0
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 16.53%
                     % of reads unmapped: other | 0.18%

WEI WANG

unread,
Apr 12, 2013, 10:10:18 PM4/12/13
to rna-...@googlegroups.com
Though I'm still waiting for your reply, meanwhile, I played the parameter --outFilterMultimapScoreRange, changed it from default 1 to 40, then I got the following statistics:
                                 Started job on |       Apr 12 18:03:54
                             Started mapping on |       Apr 12 18:19:11
                                    Finished on |       Apr 12 20:59:24
       Mapping speed, Million of reads per hour |       63.20

                          Number of input reads |       168752969
                      Average input read length |       180
                                    UNIQUE READS:
                   Uniquely mapped reads number |       60115001
                        Uniquely mapped reads % |       35.62%
                          Average mapped length |       176.40
                       Number of splices: Total |       16095272
            Number of splices: Annotated (sjdb) |       15661688
                       Number of splices: GT/AG |       15977408
                       Number of splices: GC/AG |       43364
                       Number of splices: AT/AC |       11620
               Number of splices: Non-canonical |       62880
                      Mismatch rate per base, % |       1.49%
                         Deletion rate per base |       0.02%
                        Deletion average length |       2.18
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.93
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       80364589
             % of reads mapped to multiple loci |       47.62%
        Number of reads mapped to too many loci |       0
             % of reads mapped to too many loci |       0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       16.57%
                     % of reads unmapped: other |       0.18%

Which makes sense now... But what's your official answer? :)


On Thursday, April 11, 2013 6:44:19 PM UTC-4, Alexander Dobin wrote:

Alexander Dobin

unread,
Apr 13, 2013, 8:51:28 AM4/13/13
to rna-...@googlegroups.com
Hi Wei,

you have an interesting dataset, and I while I think we could understand qualitatively what's going on, the magnitude of the effect is surprisingly large.
Without sjdb, you get only 2.4M splices, and 60M unique / 81M multi-mappers.
With sjdb, you get 80M splices, and 125M unique / 15M multi-mappers.

It looks like a very large number of your reads had short junction overhangs, which could not be placed without sjdb, but where placed to annotated junctions with sjdb, thus increasing the number of splices. This jump, from 2.4M to 80M is very large compared to our data, we usually see just 30-70% increase in the number of splices when switching from -sjdb to +sjdb. I guess it could be explained with some strange inter-play between the read length and expressed exon length in your samples. At the same time, the majority +sjdb spliced reads appear to have been multi-mappers without sjdb, and once they become spliced with +sjdb, they also acquire enough score difference with the next best locus to become unique mappers. Your third test is indeed consistent with this explanation - you increase the allowed score difference for multi-mappers from 1 to 40, and so you are reverting back to the number similar to -sjdb. However, you should still see a lot of splices in the multi-mapping reads.

My official position :) is that it is always preferable to use the annotations (sjdb) for mapping. While the magnitude of the effect you are observing is unusually large, qualitatively it follows the expectations.

 If you want to dig deeper into it for your dataset, I would look at a few reads that were mapped as multi-mapper without sjdb, and became unique mappers with sjdb. We could try to understand how STAR made the decisions in the -sjdb vs _sjdb runs.

Cheers
Alex

Alisha Holloway

unread,
Oct 24, 2013, 1:37:31 PM10/24/13
to rna-...@googlegroups.com
Can you share your command line calls?  Also, where did you find 2.3.1h?  I only see 2.3.0e on the download page.

Thanks!

Alisha

Kamil Cygan

unread,
Dec 10, 2013, 2:45:58 AM12/10/13
to rna-...@googlegroups.com
Hey Alisha, 

You can find alpha versions of the software here: ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/Alpha/

Regards, 

Kamil
Reply all
Reply to author
Forward
0 new messages