STAR recommended parameters for S cerevisiae (yeast) genome

Marco Di Stefano

unread,

May 2, 2016, 12:23:36 PM5/2/16

to rna-star

Hi,

which is the set of parameters to use for mapping RNA-seq data of S cerevisiae?

Thanks,

Marco

Alexander Dobin

unread,

May 3, 2016, 12:51:25 PM5/3/16

to rna-star

Hi Marco,

this is what I would recommend for the first run:

at the genome generation stage:

--genomeSAindexNbases 10

at the mappings stage:

--alignIntronMin <minimum expected intron size> --alignIntronMax <maximum expected intron size>

You can estimate expected intron size distribution from the annotated junctions. I would increase the range compared to the annotations to allow for detection of short and long novel introns.

Cheers

Alex

Marco Di Stefano

unread,

May 4, 2016, 6:32:35 AM5/4/16

to rna-...@googlegroups.com

Hi Alex,

Thanks for your reply. This is the first time I do this kind of analysis, so I will be pedantic to be sure that I'm implementing it correctly.

I ask you to check the next lines and tell me if they look reasonable to you.

At the genome generation stage, I use the command:

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir . --genomeFastaFiles Saccharomyces_cerevisiae.R64-1-1.dna.chromosome.fa --sjdbGTFfile Saccharomyces_cerevisiae.R64-1-1.81.gtf --outFileNamePrefix Saccharomyces_cerevisiae.R64-1-1.81.list_of_transcripts --genomeSAindexNbases 10

which should implement your suggestion.

To set the mapping stage, I estimate the expected intron size distribution looking at the file sjdbList.out.tab.

I Run the commands:

cat sjdbList.out.tab | awk '{print sqrt(($2-$3)*($2-$3))}' | sort -k 1n | head -1

0

cat sjdbList.out.tab | awk '{print sqrt(($2-$3)*($2-$3))}' | sort -k 1n | tail -1

2482

So my first guess for the command --alignIntronMin <minimum expected intron size> --alignIntronMax <maximum expected intron size>

is

--alignIntronMin 0 --alignIntronMax 5000

because the average gene size in S cerevisiae is about 1000 (estimated looking at the average region spanned by the loci annotated in the

.gtf file).

STAR --runThreadN 12 --outSAMunmapped Within --genomeDir . --readFilesIn XXXread1XXX XXXread2XXX --outFileNamePrefix XXXoutputXXX --readFilesCommand zcat --quantMode TranscriptomeSAM --limitBAMsortRAM 4000000000 --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 0 --alignIntronMax 5000 --alignMatesGapMax 2000 --outTmpDir $TMPDIR/STAR_ --outSAMattrIHstart 0 --outSAMattributes NH HI NM MD AS nM --outSAMtype BAM Unsorted SortedByCoordinate

I left the rest of the parameters to be the default ones

I'm analysing 42 samples. The results are summarised in the attached file. I provide the average (Min = Max =) per field.

Thanks a lot,

Marco

averaged_final.out

Alexander Dobin

unread,

May 4, 2016, 6:23:39 PM5/4/16

to rna-star

Hi Marco,

your parameters look reasonable.

I would use at least 4 for --alignIntronMin , I do not think the very small introns in the annotations are real https://groups.google.com/d/msg/rna-star/LqxVCE34464/GBordrd6AQAJ.

I would increase --alignMatesGapMax to at least --alignIntronMin since the gap between mates may contain a splice junction.

The mapping stats look good on average. 68% of unique mappers for the worst sample is on the low inside, and same for the 92 bases mapped out of 98.

Since you used --outFilterMultimapNmax 1, all multi-mappers map "too many times", so you have high % of reads mapped to too many loci

Cheers

Alex

Marco Di Stefano

unread,

May 4, 2016, 8:16:23 PM5/4/16

to rna-...@googlegroups.com

Hi Alex,

I run the analysis, following your suggestions:

--alignIntronMin 10

--alignMatesGapMax 5000

obtaining very similar results wrt to the previous analysis (see attached file).

The case in which are found only 68% of uniquely mapped reads is still there, but it is the only case below 75%.

I'm already happy with these parameters.

Just for curiosity, a part from increasing --outFilterMultimapNmax 1, are there other parameters that I could tune to try to increase this percentage?

(Log.final.out file in attachment)

Thanks,

Marco

averaged_final_new_set.out

Log.final.out

Alexander Dobin

unread,

May 5, 2016, 9:42:40 AM5/5/16

to rna-star

Hi Marco,

changing these parameters would only change a very small % of alignments, not affecting much the summary statistics - exactly as you see it.

The sample with 68% mapping rate looks OK for the mapped length and mismatch error rate, so the sequencing quality is unlikely to be the cause.

The next usual suspect is contamination - you could try to BLAST a few of the unmapped reads to see where they map.

Cheers

Alex

Marco Di Stefano

unread,

May 5, 2016, 9:47:59 AM5/5/16

to rna-star

Hi Alex,

Yes I'll try to use blast and check this.

I consider tho conversation completed.

Thanks Alexander for your time and very useful suggestions,

Marco

Marco Di Stefano

unread,

May 5, 2016, 9:48:20 AM5/5/16

to rna-...@googlegroups.com

Hi Alex,

Yes I'll try to use blast and check this.

I consider the conversation completed.

Thanks a lot for your time and very useful suggestions,

Marco

Reply all

Reply to author

Forward