Getting started with STAR and single-ended reads; getting output usable for differential expression

1,407 views
Skip to first unread message

Adam Cornwell

unread,
Jul 18, 2013, 5:55:49 PM7/18/13
to rna-...@googlegroups.com
I'm just getting started here with STAR, and with HTseq data in general after working with microarrays for some time. I have eight mouse samples which were run on a HiSeq with 1x100 reads, and I'm looking to align them for running differential expression analysis downstream.
Are there any suggested parameters, so should the default be good enough? What might be some telltale signs in the results that the parameters should be changed?

I got STAR to run on all eight samples by concatenating the full paths with ',' which is pretty messy, and also noticed that the output appears to be in one large SAM file. I was expecting individual aligned files, since they're different samples. What am I missing here? Is that expected?

Thanks!

James Blachly

unread,
Jul 19, 2013, 2:55:52 PM7/19/13
to Adam Cornwell, rna-...@googlegroups.com
Dear Adam

You'll need to run STAR 8 separate times. Once for each sample.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.
 
 

Adam Cornwell

unread,
Jul 22, 2013, 12:35:59 PM7/22/13
to rna-...@googlegroups.com, Adam Cornwell
That explains why seemed like everyone posting about multiple samples was using scripts. I don't recall the documentation being explicit about that, but thanks.

Alexander Dobin

unread,
Jul 22, 2013, 2:38:32 PM7/22/13
to rna-...@googlegroups.com, Adam Cornwell
Hi Adam,

James was absolutely right, if you want your alignments in separate files, you would need to run STAR separately on each input file.
Note, that STAR produces in addition to alignments (Aligned.out.sam) STAR produces a number of other files. You would need to run STAR in a fresh directory for each separate input, or use distinct  --outFileNamePrefix for each run.

You can start with default parameters if you are working with mammalian genomes. For non-mammalian genomes the most important options to tweak are 
--alignIntronMin, --alignIntronMax and --alignMatesGapMax.
I highly recommend using annotations .gtf files to generate genome files. Another recommended option that is not switched on by default is --outFilterType BySJout  -  it will reduce the number of "spurious" junctions.

Cheers
Alex

Adam Cornwell

unread,
Jul 24, 2013, 6:01:13 PM7/24/13
to rna-...@googlegroups.com, Adam Cornwell
Since I'm using currently running mouse samples and 1x100 reads, is there any reason to create a genome, or should the provided pre-constructed one be adequate? So far I've been working with the provided genome file. Thanks!

Alexander Dobin

unread,
Jul 25, 2013, 2:02:28 PM7/25/13
to rna-...@googlegroups.com, Adam Cornwell
Hi Adam,

I would recommend to create your own genome - that way you will know for sure which reference sequences you are using. Also, I highly recommend using annotations for mapping, and annotations change quite often (e.g. ENSMEBL updates annotations every 3 months). A few things to think about before generating the genome:
1. Which assembly you want to use (mm9 vs mm10)?
2. Whether to include non-chromosomal scaffolds? I recommend it, especially for the total RNA samples, since we found lately that a significant number of reads may map to rRNA loci on the scaffolds, at least for the human genome.
3. Which annotations to use (ENSEMBL, UCSC genes, RefSeq etc)?

Generating your own genome files is easy, the basic command is:

STAR --genomeDir /path/to/genome/dir/ --runMode genomeGenerate --genomeFastaFiles  /path/to/genome1.fa  /path/to/genome2.fa …  --sjdbGTFfile /path/to/annotation.gtf --sjdbOverhang 100 --runThreadN 4

--sjdbOverhang <N> should ideally be equal to read (mate) (length-1), but could be generic 100.
Note that the chromosomes’ names in genome.fasta files and annotation.gtf files should agree.

Cheers
Alex

Adam Cornwell

unread,
Jul 29, 2013, 1:24:44 PM7/29/13
to rna-...@googlegroups.com, Adam Cornwell
I'm going to try to build a genome with GRCm38 from Ensembl. I was going to try to work from the individual chromosome FASTAs (for 1-19 plus X, Y. MT, and nonchromosomal) but then I noticed that the Ensembl-based genome you provided was built with the toplevel file. Is there any downside of using the toplevel file instead? It seems like it could only be beneficial for most purposes- since it includes the patches and the normal chromosome files don't?
Thank you

Alexander Dobin

unread,
Jul 29, 2013, 4:48:56 PM7/29/13
to rna-...@googlegroups.com, Adam Cornwell
Hi Adam,

my recommendation is not to use the patches/haplotypes since they add  sequence variants but not new sequences. In the latest ENCODE releases the toplevel files grew very large. 
Including of non-chromosomal scaffolds (GL in ENSEMBL releases) is important. Some discussion about this is in this post.

Cheers
Alex

Adam Cornwell

unread,
Jul 29, 2013, 5:58:35 PM7/29/13
to rna-...@googlegroups.com, Adam Cornwell
(last question in this thread, since I think I'm getting the hang of this...)
So the current recommendation would be to do what I was thinking about first and build the genome with the FASTA files for chromosomes 1-19,  X, Y. MT, and nonchromosomal? (in the case of mouse)

Thanks for all the help!
Adam Cornwell

Alexander Dobin

unread,
Jul 29, 2013, 7:03:11 PM7/29/13
to rna-...@googlegroups.com, Adam Cornwell
Yes - the link in the previous post did no work, here is the post that discusses it https://groups.google.com/d/msg/rna-star/1ngCYlgAbow/g5B5g83Vim8J
I cannot connect to ENSEMBL web-site to check what they have for the latest mouse release.
As Shawn pointed out, for latest releases they should have   *dna.primary.fa file that contains only  chromosomes 1-19,  X, Y. MT, and nonchromosomal GL contigs. This is my current recommendation.

N Far

unread,
Jan 19, 2015, 1:11:00 PM1/19/15
to rna-...@googlegroups.com, cornwe...@gmail.com
Hi Alexander,

I have a question in this regard related to mm9 genome. When trying to download the .fasta files for Ensembl mm9, there are no *dna.primary.fa files (ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/dna/). I can see the list of *.dna.chromosome*.fa for 1-19, X, Y and MT, but don't see any file that appears to include "nonchromosomal GL contigs" info. 
Do you suggest using only these files to build STAR genome? 
Also, there used to be some default genomes available to download from STAR website; have they been removed?

Thank You,
Noushin

Alexander Dobin

unread,
Jan 22, 2015, 3:43:54 PM1/22/15
to rna-...@googlegroups.com, cornwe...@gmail.com
Hi Noushin,

for this -old-  ENSEMBL release you can use the "toplevel" file, which combines both chromosomes and non-chromosomal contigs, the latter contain only 62MB of sequence.
Note that in later releases, there is a primary_assmebly file, e.g.:

You can find a few STAR genomes here:

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages