STAR recommended parameters for S cerevisiae (yeast) genome

1,446 views
Skip to first unread message

Marco Di Stefano

unread,
May 2, 2016, 12:23:36 PM5/2/16
to rna-star
Hi,

which is the set of parameters to use for mapping RNA-seq data of S cerevisiae?

Thanks,
Marco

Alexander Dobin

unread,
May 3, 2016, 12:51:25 PM5/3/16
to rna-star
Hi Marco,

this is what I would recommend for the first run:

at the genome generation stage:
--genomeSAindexNbases 10
at the mappings stage:
--alignIntronMin   <minimum expected intron size> --alignIntronMax <maximum expected intron size>

You can estimate expected intron size distribution from the annotated junctions. I would increase the range compared to the annotations to allow for detection of short and long novel introns.

Cheers
Alex

Marco Di Stefano

unread,
May 4, 2016, 6:32:35 AM5/4/16
to rna-...@googlegroups.com
Hi Alex,

Thanks for your reply. This is the first time I do this kind of analysis, so I will be pedantic to be sure that I'm implementing it correctly.
I ask you to check the next lines and tell me if they look reasonable to you.

At the genome generation stage, I use the command:

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir . --genomeFastaFiles Saccharomyces_cerevisiae.R64-1-1.dna.chromosome.fa --sjdbGTFfile Saccharomyces_cerevisiae.R64-1-1.81.gtf --outFileNamePrefix Saccharomyces_cerevisiae.R64-1-1.81.list_of_transcripts --genomeSAindexNbases 10


which should implement your suggestion.

To set the mapping stage, I estimate the expected intron size distribution looking at the file sjdbList.out.tab. 
I Run the commands:


cat sjdbList.out.tab | awk '{print sqrt(($2-$3)*($2-$3))}' | sort -k 1n | head -1

0

cat sjdbList.out.tab | awk '{print sqrt(($2-$3)*($2-$3))}' | sort -k 1n | tail -1

2482


So my first guess for the command --alignIntronMin   <minimum expected intron size> --alignIntronMax <maximum expected intron size>
is
--alignIntronMin 0 --alignIntronMax 5000

because the average gene size in S cerevisiae is about 1000 (estimated looking at the average region spanned by the loci annotated in the
.gtf file).

STAR --runThreadN 12 --outSAMunmapped Within --genomeDir . --readFilesIn XXXread1XXX XXXread2XXX --outFileNamePrefix XXXoutputXXX --readFilesCommand zcat --quantMode TranscriptomeSAM --limitBAMsortRAM 4000000000 --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 0 --alignIntronMax 5000 --alignMatesGapMax 2000 --outTmpDir $TMPDIR/STAR_ --outSAMattrIHstart 0 --outSAMattributes NH HI NM MD AS nM --outSAMtype BAM Unsorted SortedByCoordinate 


I left the rest of the parameters to be the default ones

I'm analysing 42 samples. The results are summarised in the attached file. I provide the average (Min = Max =) per field.



Thanks a lot,
Marco 
averaged_final.out

Alexander Dobin

unread,
May 4, 2016, 6:23:39 PM5/4/16
to rna-star
Hi Marco,

your parameters look reasonable.
I would use at least 4 for --alignIntronMin , I do not think the very small introns in the annotations are real https://groups.google.com/d/msg/rna-star/LqxVCE34464/GBordrd6AQAJ.
I would increase --alignMatesGapMax to at least --alignIntronMin since the gap between mates may contain a splice junction.

The mapping stats look good on average. 68% of unique mappers for the worst sample is on the low inside, and same for the 92 bases mapped out of 98.
Since you used --outFilterMultimapNmax 1, all multi-mappers map "too many times", so you have high % of reads mapped to too many loci 

Cheers
Alex

Marco Di Stefano

unread,
May 4, 2016, 8:16:23 PM5/4/16
to rna-...@googlegroups.com
Hi Alex,

I run the analysis, following your suggestions:

--alignIntronMin               10
--alignMatesGapMax  5000

obtaining very similar results wrt to the previous analysis (see attached file).
The case in which are found only 68% of uniquely mapped reads is still there, but it is the only case below 75%.

I'm already happy with these parameters. 
Just for curiosity, a part from increasing --outFilterMultimapNmax 1, are there other parameters that I could tune to try to increase this percentage?
(Log.final.out file in attachment)


Thanks,
Marco


averaged_final_new_set.out
Log.final.out

Alexander Dobin

unread,
May 5, 2016, 9:42:40 AM5/5/16
to rna-star
Hi Marco,

changing these parameters would only change a very small % of alignments, not affecting much the summary statistics - exactly as you see it.

The sample with 68% mapping rate looks OK for the mapped length and mismatch error rate, so the sequencing quality is unlikely to be the cause.
The next usual suspect is contamination - you could try to BLAST a few of the unmapped reads to see where they map.

Cheers
Alex

Marco Di Stefano

unread,
May 5, 2016, 9:47:59 AM5/5/16
to rna-star
Hi Alex,

Yes I'll try to use blast and check this.

I consider tho conversation completed.

Thanks Alexander for your time and very useful suggestions,
Marco

Marco Di Stefano

unread,
May 5, 2016, 9:48:20 AM5/5/16
to rna-...@googlegroups.com
Hi Alex,


Yes I'll try to use blast and check this.
I consider the conversation completed.



Thanks a lot for your time and very useful suggestions,
Marco
Reply all
Reply to author
Forward
0 new messages