Re: Increase unique mapper ratio dramatically after providing splicing junction annotation

Alexander Dobin

unread,

Apr 11, 2013, 6:44:19 PM4/11/13

to rna-...@googlegroups.com

Hi Wei,

the annotations improve mappability of the reads with short overhangs over splice junctions. If your reads are short, this could make all the difference in terms of unique/multi-mappers.

The magnitude of the effect is a bit surprising. If you could post the Log.final.out files from both sjdb/no-sjdb runs, I could try to look for an explanation.

Cheers

Alex

On Thursday, April 11, 2013 5:22:41 PM UTC-4, WEI WANG wrote:

Hi Alex,

I run STAR twice for my paired-end RNAseq datasets, with and without sjdb index.

Without sjdb (I ran it use STAR_2.3.0e), I got 35.26% unique mappers and 48.23% multiple mappers.

With sjdb annotation (STAR_2.3.1h), I got 74.32% unique mappers and 8.97% multiple mappers when --sjdbScore was set to 2;
72.98% unique mappers and 10.26% multiple mappers, when --sjdbScore was set to 0.

What could be the explanation to the huge discrepancies of unique mapper ratio?

Thanks

Wei

Message has been deleted

WEI WANG

unread,

Apr 12, 2013, 10:32:13 AM4/12/13

to rna-...@googlegroups.com

The Log.final.out without sjdb:

Started job on | Mar 30 23:42:53

Started mapping on | Mar 31 00:13:20

Finished on | Mar 31 03:57:35

Mapping speed, Million of reads per hour | 45.15

Number of input reads | 168752969

Average input read length | 180

UNIQUE READS:

Uniquely mapped reads number | 59496854

Uniquely mapped reads % | 35.26%

Average mapped length | 176.46

Number of splices: Total | 2421397

Number of splices: Annotated (sjdb) | 0

Number of splices: GT/AG | 2235903

Number of splices: GC/AG | 10639

Number of splices: AT/AC | 2550

Number of splices: Non-canonical | 172305

Mismatch rate per base, % | 1.51%

Deletion rate per base | 0.06%

Deletion average length | 2.39

Insertion rate per base | 0.03%

Insertion average length | 2.36

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 81388848

% of reads mapped to multiple loci | 48.23%

Number of reads mapped to too many loci | 0

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 16.40%

% of reads unmapped: other | 0.12%

The one with sjdb:

Started job on | Apr 10 10:55:28

Started mapping on | Apr 10 11:10:00

Finished on | Apr 10 13:17:35

Mapping speed, Million of reads per hour | 79.36

Number of input reads | 168752969

Average input read length | 180

UNIQUE READS:

Uniquely mapped reads number | 125416524

Uniquely mapped reads % | 74.32%

Average mapped length | 176.33

Number of splices: Total | 80268755

Number of splices: Annotated (sjdb) | 77598712

Number of splices: GT/AG | 78967027

Number of splices: GC/AG | 763542

Number of splices: AT/AC | 205914

Number of splices: Non-canonical | 332272

Mismatch rate per base, % | 1.58%

Deletion rate per base | 0.06%

Deletion average length | 2.46

Insertion rate per base | 0.04%

Insertion average length | 2.24

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 15132014

% of reads mapped to multiple loci | 8.97%

Number of reads mapped to too many loci | 0

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 16.53%

% of reads unmapped: other | 0.18%

WEI WANG

unread,

Apr 12, 2013, 10:10:18 PM4/12/13

to rna-...@googlegroups.com

Though I'm still waiting for your reply, meanwhile, I played the parameter --outFilterMultimapScoreRange, changed it from default 1 to 40, then I got the following statistics:

Started job on | Apr 12 18:03:54

Started mapping on | Apr 12 18:19:11

Finished on | Apr 12 20:59:24

Mapping speed, Million of reads per hour | 63.20

Number of input reads | 168752969

Average input read length | 180

UNIQUE READS:

Uniquely mapped reads number | 60115001

Uniquely mapped reads % | 35.62%

Average mapped length | 176.40

Number of splices: Total | 16095272

Number of splices: Annotated (sjdb) | 15661688

Number of splices: GT/AG | 15977408

Number of splices: GC/AG | 43364

Number of splices: AT/AC | 11620

Number of splices: Non-canonical | 62880

Mismatch rate per base, % | 1.49%

Deletion rate per base | 0.02%

Deletion average length | 2.18

Insertion rate per base | 0.01%

Insertion average length | 1.93

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 80364589

% of reads mapped to multiple loci | 47.62%

Number of reads mapped to too many loci | 0

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 16.57%

% of reads unmapped: other | 0.18%

Which makes sense now... But what's your official answer? :)

On Thursday, April 11, 2013 6:44:19 PM UTC-4, Alexander Dobin wrote:

Alexander Dobin

unread,

Apr 13, 2013, 8:51:28 AM4/13/13

to rna-...@googlegroups.com

Hi Wei,

you have an interesting dataset, and I while I think we could understand qualitatively what's going on, the magnitude of the effect is surprisingly large.

Without sjdb, you get only 2.4M splices, and 60M unique / 81M multi-mappers.

With sjdb, you get 80M splices, and 125M unique / 15M multi-mappers.

It looks like a very large number of your reads had short junction overhangs, which could not be placed without sjdb, but where placed to annotated junctions with sjdb, thus increasing the number of splices. This jump, from 2.4M to 80M is very large compared to our data, we usually see just 30-70% increase in the number of splices when switching from -sjdb to +sjdb. I guess it could be explained with some strange inter-play between the read length and expressed exon length in your samples. At the same time, the majority +sjdb spliced reads appear to have been multi-mappers without sjdb, and once they become spliced with +sjdb, they also acquire enough score difference with the next best locus to become unique mappers. Your third test is indeed consistent with this explanation - you increase the allowed score difference for multi-mappers from 1 to 40, and so you are reverting back to the number similar to -sjdb. However, you should still see a lot of splices in the multi-mapping reads.

My official position :) is that it is always preferable to use the annotations (sjdb) for mapping. While the magnitude of the effect you are observing is unusually large, qualitatively it follows the expectations.

If you want to dig deeper into it for your dataset, I would look at a few reads that were mapped as multi-mapper without sjdb, and became unique mappers with sjdb. We could try to understand how STAR made the decisions in the -sjdb vs _sjdb runs.

Cheers

Alex

Alisha Holloway

unread,

Oct 24, 2013, 1:37:31 PM10/24/13

to rna-...@googlegroups.com

Can you share your command line calls? Also, where did you find 2.3.1h? I only see 2.3.0e on the download page.

Thanks!

Alisha

Kamil Cygan

unread,

Dec 10, 2013, 2:45:58 AM12/10/13

to rna-...@googlegroups.com

Hey Alisha,

You can find alpha versions of the software here: ftp://ftp2.cshl.edu/gingeraslab/tracks/STARrelease/Alpha/