If building genome index with transcriptome GTF, is 2-Pass scheme needed/ relevant then?

2,226 views
Skip to first unread message

amit mandal

unread,
Apr 21, 2014, 8:11:41 AM4/21/14
to rna-...@googlegroups.com
hi,
 I am trying to implement the 2-Pass scheme and was wondering that in case I have built the Genome index with a GTF file then does 2-Pass do any quality addition (mapping accuracy etc.)?

I am working on cancer patient samples and there are a bunch of them which can't be treated as replicates. To implement the 2-Pass means building separate genome index (with the STAR out SJdb) for each sample.
If I have built the Genome index with the transcriptome GTF at the first place, what I might loose if I do just one round of alignment?

Similar question has had been asked by Daniel Fernandez in this thread -https://groups.google.com/forum/#!topic/rna-star/39AV3t3FDrk
but had not been answered.

Any suggestions or thoughts would be appreciated.

-
am

Alexander Dobin

unread,
Apr 22, 2014, 2:05:56 PM4/22/14
to rna-...@googlegroups.com
Hi Amit,

the 2-pass scheme increases sensitivity to unannotated splices, i.e. it will allow for more reads to be mapped to unannotated junctions.
If your annotations are comprehensive (such as Gencode), there will be only small overall effect on mapping accuracy.
However, if you are looking for low abundance novel junctions, 2-pass approach could be useful. It  is also useful for SNP calling, as GATK team found.

If you decide to go with 2-pass, you will not need to generate the 2-nd pass genome for all samples. 
A better approach is to combine novel junctions from the 1st pass of *all* samples, generate the 2-nd pass genome with this combined set, and then re-run all samples with this singular genome.

Cheers
Alex

JC Grenier

unread,
May 7, 2014, 11:33:17 AM5/7/14
to rna-...@googlegroups.com
Hi Alex,

I also have a question kind of related to that one. I've tested couple things with STAR and I don't quite understand the results that I'm getting. Here's what I've tested :

1) STAR with annotation GTF file coming from UCSC. (getting 94% of uniquely mapped reads)
2) Do a reference index using the SJ.out file coming from my freshly processed sample and the same UCSC GTF file than with the first analysis.
3) Re-running STAR (91% of uniquely mapped reads now..)

What can be wrong with this approach?

Thanks a lot for your help!

Alexander Dobin

unread,
May 7, 2014, 1:12:41 PM5/7/14
to rna-...@googlegroups.com
Hi JC,

one of the possible explanations is that 3% of the reads were mapping with a short overhang to GTF annotated junctions only in the 1st pass, while in the 2nd pass they map equally well to the novel junctions added to the database. It would be great if you could check that. The magnitude of the effect seems a bit high. You may try to do more stringent filtering on the junctions that are included in the database, e.g. require 2 or more reads support in the 1st pass. In the end it's a trade-off between sensitivity and precision - if you include more junctions, you increase precision, but reduce sensitivity.

Cheers
Alex

JC Grenier

unread,
May 7, 2014, 4:54:57 PM5/7/14
to rna-...@googlegroups.com
Hi Alex, thanks for this answer. I'm presently redoing my analysis by only including the splice junctions supported by at least 2 reads. It's increasing a little the amount of uniquely mapped reads but not a lot (going from 91.9% to 92.1%). However, if I'm looking intot the final.out file, the bigger difference is coming from the % of reads unmapped : too short, passing from 1.63% to 3.68%. So I suppose that the new junctions are inferring some soft clipping that are inferring too short reads? (I assume that based on the average mapped length : 191.22 vs 186.46). 

How can this approach be an advantage if I already trimmed my reads for adaptors or bad end qualities?

Thanks a lot,

Alexander Dobin

unread,
May 8, 2014, 12:12:34 PM5/8/14
to rna-...@googlegroups.com
Hi JC,

this is worrisome, I was expecting these reads to become multimappers, not unmapped - too short - reads.
Could you test/replicate this effect on a small subset of reads (a few million reads), and send it to me - I will have a closer look.

Cheers
Alex

JC Grenier

unread,
May 8, 2014, 2:10:41 PM5/8/14
to rna-...@googlegroups.com
Hi Alex, 

I think I found what was actually happening. I'm testing it, but it's probably due to the option --sjdbOverhang. 
For my original index, I put 75 to this value. While now, I was putting 99, as my reads are 100bp.

I suppose that will correct this issue.
I'll let you know as soon as I have the results.

Thanks

JC Grenier

unread,
May 8, 2014, 3:55:16 PM5/8/14
to rna-...@googlegroups.com
Finally this option doesn't change a thing for my problem.

I'll send you an example. How do you want me to proceed?

Alexander Dobin

unread,
May 9, 2014, 12:24:45 PM5/9/14
to rna-...@googlegroups.com
Hi JC,

I think the easiest way would be to cut ~100k reads from the middle of your fastq, map them to the normal genome and your 2-nd pass genome.
If the effect persists, it should give me a few thousands reads that are not mapped to the 2-nd pass genome.
Then please send me the 100k reads and the splice junctions file you used to generate 2-nd pass genome.

Cheers
Alex

JC Grenier

unread,
May 9, 2014, 1:22:56 PM5/9/14
to rna-...@googlegroups.com
Hi Alex, I already did the test but with the first 1M reads of my FASTQs. It should not make any difference right? Anyway the stats are the same so clearly not. How do I send that to you?

Thanks a lot for your help!

Alexander Dobin

unread,
May 14, 2014, 12:28:41 PM5/14/14
to rna-...@googlegroups.com
Hi JC,

I run the 2-pass scheme on your data and cannot replicate the behavior you observed. I get very small decrease of the unique mapping rate in the 2nd pass, and all of it due to increase in multi-mapping rate:
1                         Uniquely mapped reads % |     93.22%
2                         Uniquely mapped reads % |     93.13%
1          % of reads mapped to multiple loci |     4.49%
2          % of reads mapped to multiple loci |     4.61%
1               % of reads unmapped: too short |     2.23%
2               % of reads unmapped: too short |     2.20%
How did you setup your 2-pass runs?
Here is what I did:
1-pass run:
genome files:
STAR   --runMode genomeGenerate   --runThreadN 6   --genomeDir Genome1   --genomeFastaFiles hg19_fasta/chr{[0-9],[0-9][0-9],X,Y,M,[0-9]_gl*,[0-9][0-9]_gl*,Un_gl*}.fa   --sjdbGTFfile human_refGene.UCSC2013-03-06-11-23-03.hg19.gtf   --sjdbOverhang 100
mapping:
STAR --genomeDir Genome1/ --genomeLoad LoadAndKeep --readFilesIn SampleTest_R1_trimmed.1M.fastq.gz SampleTest_R2_trimmed.1M.fastq.gz --readFilesCommand zcat

2-pass run:
prepare splice junctions:
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' ../SampleTest.trimQ20.STAR-2pass.sjbd99.SJ.out.tab > SJin.tab
genome files:
STAR   --runMode genomeGenerate   --runThreadN 6   --genomeDir Genome2   --genomeFastaFiles hg19_fasta/chr{[0-9],[0-9][0-9],X,Y,M,[0-9]_gl*,[0-9][0-9]_gl*,Un_gl*}.fa   --sjdbGTFfile human_refGene.UCSC2013-03-06-11-23-03.hg19.gtf   --sjdbOverhang 100 --sjdbFileChrStartEnd SJin.tab
mapping:
STAR --genomeDir Genome2/ --genomeLoad LoadAndKeep --readFilesIn SampleTest_R1_trimmed.1M.fastq.gz SampleTest_R2_trimmed.1M.fastq.gz --readFilesCommand zcat

Cheers
Alex

JC Grenier

unread,
May 15, 2014, 1:27:28 PM5/15/14
to rna-...@googlegroups.com
Thanks Alex,

What can differ between my way and your way to do it is the reference used and the way you are taking the genomeFastaFiles. I only give one fasta with all the chromosomes in it. sjdbOverhang is at 99 instead of 100 as well. You also used the --genomeLoad LoadAndKeep, which I do not use.
Maybe the --outFilterIntronMotifs RemoveNoncanonicalUnannotated option as well is what is causing the problem..!

The version that I used is the 2.3.1y one.

Here's what I did :

1-pass run : 
Genome file :
STAR --runMode genomeGenerate --genomeDir $HOME/References/hg19/CEU/STAR --genomeFastaFiles $HOME/References/hg19/CEU/CEUref_hg19.fasta --runThreadN 12 --sjdbGTFfile $HOME/References/hg19/GTF/human_refGene.UCSC2013-03-06-11-23-03.hg19.gtf --sjdbOverhang 99

Sample Run:
STAR --genomeDir $HOME/References/hg19/CEU/STAR/ --readFilesIn SampleTest_R1_trimmed.1M.fastq.gz SampleTest_R2_trimmed.1M.fastq.gz --runThreadN 12 --readFilesCommand zcat --outSAMstrandField intronMotif --outFileNamePrefix SampleTest.trimQ20.STAR.i2. --outFilterIntronMotifs RemoveNoncanonicalUnannotated

2-pass :
awk '$7>=5' SampleTest.trimQ20.STAR.i2.SJ.out.tab > SampleTest.trimQ20.STAR.i2.SJ.cov5.out.tab
Genome file :
STAR --runMode genomeGenerate --genomeDir genomeStar_2pass_sjdb99_cov5 --genomeFastaFiles $HOME/References/hg19/CEU/CEUref_hg19.fasta --sjdbFileChrStartEnd SampleTest.trimQ20.STAR.i2.SJ.cov5.out.tab --sjdbGTFfile $HOME/References/hg19/GTF/human_refGene.UCSC2013-03-06-11-23-03.hg19.gtf --sjdbOverhang 99 --runThreadN 12

Sample Run :
STAR --genomeDir genomeStar_2pass --runThreadN 12 --readFilesIn SampleTest_R1_trimmed.1M.fastq.gz SampleTest_R2_trimmed.1M.fastq.gz --readFilesCommand zcat --outSAMstrandField intronMotif --outFileNamePrefix SampleTest.trimQ20.STAR-2pass. --outFilterIntronMotifs RemoveNoncanonicalUnannotated

Cheers,

JC

Alexander Dobin

unread,
May 29, 2014, 11:31:36 AM5/29/14
to rna-...@googlegroups.com
Hi JC,

I had a thought that maybe it's the absence of "strand" conversion that makes things go wrong, please see this post.
Can you try to do this
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' SampleTest.trimQ20.STAR.i2.SJ.out.tab > SampleTest.trimQ20.STAR.i2.SJ.cov5.out.tab
and see if the effect disappears?

Cheers
Alex

JC Grenier

unread,
Jun 11, 2014, 3:51:13 PM6/11/14
to rna-...@googlegroups.com
Sorry for the delay. I will try that for sure! Thanks for your help! 

I keep you posted.

JC Grenier

unread,
Jun 12, 2014, 3:34:52 PM6/12/14
to rna-...@googlegroups.com
Hi Alex,

It clearly seems to do something here when I'm applying your filter. I'm recovering 0.8% of the reads that were aligning at multiple locations and recovering 2.7% that were labelled as too short.

Thanks again for your precious help!
Reply all
Reply to author
Forward
0 new messages