The way I use 2-pass mode is it good ?

2,056 views
Skip to first unread message

darbot...@gmail.com

unread,
Mar 14, 2019, 10:36:26 AM3/14/19
to rna-star
Hello !

In order to find differential alternative spliced genes between my two conditions and (3 replicates for 2 conditions), I use the 2-pass protocol with STAR, and after that I use rMATS for the statistic part

Before to use rMATS , I want to be sure that the way I used the 2 pass mode was the right one (especially for the parameters and flag us.

So firstly, I generate genome index for the 1-pass in this way:

STAR --runThreadN 8 --runMode genomeGenerate --genomeDir pass1/Ref --genomeFastaFiles Reference/reference.fasta --sjdbGTFfile Reference/reference.gtf 


After that, I filter the SJ out files in this way:


gawk '$6==1 || ($6==0 && ($7>2))' pass1/*SJ.out.tab > SJ.filt.tab

(if the SJ come from the GTF file, I keep it ($6==1), and if the SJ is detected de novo, I keep it only if there at least 2 reads associated with this SJ, because I see that many off the SJ detected de novo have 1 or 2 reads and I think that can be associated with many false positives, am I wrong ?


Well, I rearrange the file in this way in order to keep only columns for generate a gif file:
awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' SJ.filt.tab > SJ.out.tab

Finally, in order to eliminate duplicates SJ and to create the final db, I do this:

sort SJ.out.tab | uniq > SJ.out.tab.db


And after that, I run the 2nd mapping in this way:

nohup STAR --runThreadN 8 --genomeDir genomeForPass2/ \

--sjdbGTFfile Reference/reference.gtf \

--outFileNamePrefix pass2/Col_3 \

--readFilesIn TrimmingHS/Col_3_R1.trim.adapt.fastq.gz TrimmigHS/Col_3_R2.trim.adapt.fastq.gz \

--alignEndsType EndToEnd --sjdbOverhang 114 --readFilesCommand "gunzip -c" --outSAMtype BAM SortedByCoordinate &




My paired ends reads have 115bp length (I have trimmed them to a unique length because rMATS require a specific read length), so the "--sjdbOverhang 114" is the good parameter ? (I don't understand well this parameter but I have read that the better value is read length - 1).

I have also read that it better to use " --alignEndsType EndToEnd ", I don't understand why but my mapping stats are good (about 97% uniquely mapped reads), 


If you have any suggestion, thank's by advance to help me

Alexander Dobin

unread,
Mar 15, 2019, 10:04:59 AM3/15/19
to rna-star
Hi @darbotvincent

your pipeline looks good, though the description seems to be missing the 2-pass genome index generation?

A few comments:

1. These steps are not necessary, STAR will do them internally:
$ awk 'BEGIN {OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";} {if($5>0){print $1,$2,$3,strChar[$4]}}' SJ.filt.tab > SJ.out.tab
$ sort SJ.out.tab | uniq > SJ.out.tab.db
You can feed the SJ.filt.tab directly in the 2-pass genome index generation:
$ STAR --runThreadN 8 --runMode genomeGenerate --genomeDir  genomeForPass2/ --genomeFastaFiles Reference/reference.fasta --sjdbGTFfile Reference/reference.gtf --sjdbFileChrStartEnd SJ.filt.tab

OR

You can bypass the genome generation, and use --sjdbFileChrStartEnd SJ.filt.tab direclty at the mapping stage, i.e.

$ STAR --runThreadN 8 --genomeDir pass1/Ref --sjdbFileChrStartEnd SJ.filt.tab \

--sjdbGTFfile Reference/reference.gtf \ do not need this option

--outFileNamePrefix pass2/Col_3 \

--readFilesIn TrimmingHS/Col_3_R1.trim.adapt.fastq.gz TrimmigHS/Col_3_R2.trim.adapt.fastq.gz \

--alignEndsType EndToEnd --sjdbOverhang 114 --readFilesCommand "gunzip -c" --outSAMtype BAM SortedByCoordinate 


2. You can concatenate SJ.filt.tab from all the samples, and generate one 2-pass genome for all samples. This should allow for more accurate comparisons between samples, however, if there are too many novel junctions, it may cause mapping slowdown and loss of unique mappers.

3. I do not recommend --alignEndsType EndToEnd for long RNA-seq and reads longer than ~50b. If you want to increase the stringency of mapping, I would recommend increasing --outFilterScoreMinOverLread from default 0.66 to 0.9 or even 0.95.

Cheers
Alex


On Thursday, March 14, 2019 at 10:36:26 AM UTC-4, 
Reply all
Reply to author
Forward
0 new messages