STAR 2-pass generated less uniquely mapped reads comparing to 1-pass

Dadi Gao

unread,

May 9, 2016, 11:32:10 AM5/9/16

to rna-star

Hi,

Thanks for developing this fantastic tool! I'm a little confused about 2-pass and looking for some help.

I got 11617 novel junctions in the 1-pass. Then I generated a new genome index for 2-pass. I found 11616 new lines were added to sjdbList.out.tab, saying one novel junction was "skipped". My first question is what's the principle behind this.

After that, I re-mapped all the samples to this new genome, with exact same parameters during 1-pass. I ended up with slightly less uniquely mapped reads for each samples from 2-pass, compared to those from 1-pass. I was expecting new junctions would increase chance to map, however it seems new junction made previously unique alignments map to additional locus and reduce the uniqueness. Is this normal? Have I done something wrong or how should I explain it?

P.S. I allowed 2 mismatches (--outFilterMismatchNmax 2) and only reported uniquely mapped reads (--outFilterMultimapNmax 1), and with "--sjdbFileChrStartEnd -" as I've already built reference for 2-pass. All the other parameters were set as default.

Thanks a lot,

Dadi

Alexander Dobin

unread,

May 9, 2016, 5:22:45 PM5/9/16

to rna-star

Hi Dadi,

Thanks for developing this fantastic tool! I'm a little confused about 2-pass and looking for some help.
I got 11617 novel junctions in the 1-pass. Then I generated a new genome index for 2-pass. I found 11616 new lines were added to sjdbList.out.tab, saying one novel junction was "skipped". My first question is what's the principle behind this.

This looks like a bug. I will check it and release a patch - if necessary - tomorrow.

After that, I re-mapped all the samples to this new genome, with exact same parameters during 1-pass. I ended up with slightly less uniquely mapped reads for each samples from 2-pass, compared to those from 1-pass. I was expecting new junctions would increase chance to map, however it seems new junction made previously unique alignments map to additional locus and reduce the uniqueness. Is this normal? Have I done something wrong or how should I explain it?

This is normal behavior. Since in the 2nd pass the reference sequence "space" increase owing to the novel junctions, reads have a higher chance of mapping to more than one locus with a similar quality.

This is especially try for reads with very short junction overhangs

Cheers

Alex

Dadi Gao

unread,

May 9, 2016, 10:48:09 PM5/9/16

to rna-star

Hi Alex,

Thanks heaps for the explanation. That really makes sense now. May I also confirm the other concern please?

As building the human genome for the first pass, I used an Ensembl GTF file and got sjdbList.fromGTF.out.tab and sjdbList.out.tab in the STAR reference index folder. The former one contains about 100 more junctions than the latter one. Is this because that the latter one is purely putative based on FASTA file only?

And during 1-pass mapping, I only gave the STAR reference folder without GTF file options. If I understand correctly, STAR automatically uses both sjdbList files to maximise the number of reference junctions, if sjdbList.fromGTF.out.tab exists inside the folder. Am I right?

Best,

Dadi

Alexander Dobin

unread,

May 10, 2016, 5:10:54 PM5/10/16

to rna-star

Hi Dadi,

these are insightful observations.

sjdbList.fromGTF.out.tab file is the list of junctions extracted from the GTF file and trivially collapsed, i.e. junctions with identical chr/start/end/strand from alternative isoforms are collapsed.

sjdbList.out.tab is the list of junctions after more sophisticated collapsing. Namely, STAR collapses the junctions indistinguishable because of microrepeats, and junctions with exactly the same coordinates on opposite strands.

Usually, there are a very few of this strange junctions annotated.

Also, I think for the same reason you see fewer "new" junctions in the 2nd pass compared to the SJ.out.tab of the 1st pass - basically, one novel junction from the 1st pass is collapsed into another junction.

If you send me the _STARpass1/SJ.out.tab and _STARgenome/sjdbList.out.tab, I can point out which junction that is.

And during 1-pass mapping, I only gave the STAR reference folder without GTF file options. If I understand correctly, STAR automatically uses both sjdbList files to maximise the number of reference junctions, if sjdbList.fromGTF.out.tab exists inside the folder. Am I right?

This is correct. STAR will use the GTF file both in the 1st and 2nd pass, and for the 2nd pass will add the novel junctions from the 1st pass .

Cheers

Alex

Dadi Gao

unread,

May 10, 2016, 7:57:26 PM5/10/16

to rna-star

Hi Alex,

Thanks indeed for your help. Yes, the "missing" junction has a symmetric junction on the opposite strand. I forgot to include the strand column during my previous matching up so that I thought it was there. Actually it was the one on the opposite strand.