Phred64 encoding consideration in STAR

Dhirendra Kumar

unread,

Oct 24, 2013, 7:17:18 AM10/24/13

to rna-...@googlegroups.com

Hi,

I am in the process of switching from tophat to STAR for read alignment. While in tophat, phred quality encoding needs to be defined, in case of STAR I could not find a comparable parameter. Does STAR automatically determines the phred encoding?

Another query is related to genomeGenerate step.

As I am using publicly available data, these have variant read lengths. Do I need to generate separate genome indexes with variant

--sjdbOverhang parameters?

How much would a generalized "--sjdbOverhang 100" parameter setting cost in case of short reads?

Alexander Dobin

unread,

Oct 24, 2013, 3:22:31 PM10/24/13

to rna-...@googlegroups.com

Hi Dhirendra,

STAR does not need to phred quality encoding, at the moment it does not actually use quality scores for mapping. It will simply copy the quality strings into SAM output. If you want to convert the quality scores output is SAM for some reason, you can use --outQSconversionAdd <positive or negative number>.

Using --sjdbOverhang 100 for shorter read lengths is safe - I have never seen a case where it makes a lot of difference. It may reduce the speed a little bit for very short reads, and it may very slightly reduce sensitivity for reads mapping in the very repetitive regions.

Cheers

Alex

Dhirendra Kumar

unread,

Oct 24, 2013, 4:29:06 PM10/24/13

to rna-...@googlegroups.com

Hi Alex,

Thanks a lot. I am simply amazed by the way STAR software performs specially in terms of time.

I followed your post on SEQanswers forum about comparison of tophat/tophat2 and STAR but I could not get your final comment on it. I want and am trying to compare both these software specially for how these perform in terms of splice junction discovery. It will of great help if you suggest optimal parameters for STAR to discover new splice variants from 80 bp reads from HiSeq.

Alexander Dobin

unread,

Oct 25, 2013, 11:09:27 AM10/25/13

to rna-...@googlegroups.com

Hi Dhirendra,

my final comment on TopHat2 paper is not out yet... but (for real!) it will be out in the next few days.

Default parameters should work fine for 80b reads. Even if you are interested in the novel splices, I highly recommend using annotations at the genome generation step - that will make the detection of novel splices more accurate.

You can reduce --seedSearchStartLmax to 25 to increase sensitivity a bit. STAR has a number of filters for the junctions that are output to SJ.out.tab, these parameters start with --outSJfilter*.

Ultimately, for the most sensitive novel junction discovery,I would recommend running STAR in the 2-pass fashion.

It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions.

You would need to run the 1st pass with usual parameters, than convert the SJ.out.tab into the splice junction database file which is (together with annotations) used to generate a new genome index for STAR, and then run the 2nd pass of STAR with the new genome index. I have attached a simple 2-pass script that you can modify to fit your needs.

If you have many samples, you can collect all the novel junctions from all the samples (SJ.out.tab files), possibly filter them for reliability, and create one common set of novel junctions for all samples by merging them. Then you generate a new genome using annotated junctions and the common set of novel junctions, and re-run all the samples with this new genome - this would be the 2-nd pass.

Cheers

Alex

STAR_2pass.sh

Reply all

Reply to author

Forward