Need recommendations for mouse sequence/annotation files to use for building index.

723 views
Skip to first unread message

kows...@gmail.com

unread,
Feb 17, 2015, 9:54:35 AM2/17/15
to rna-...@googlegroups.com

Hello everyone,

 

I apologize in advance for what I believe is going to be a rather silly question, but better to be safe than sorry. A brief background: I am starting to use STAR for DE analysis of differently treated mouse cell lines. However, I am not quite sure which mouse genome sequence and annotation files to use to fully satisfy STAR manual recommendations. If I understood correctly, STAR Manual v2.4.0.1 suggests that sequence file should contain chromosomal DNA, mitochondrial DNA and scaffolds, and that GENCODE annotations of those sequences are recommended. Also, Alex suggests that patches and alternative haplotypes should not be included in the genome (Manual, and http://seqanswers.com/forums/showthread.php?t=27470&page=5). Since GENCODE has both the annotations and sequence files for mouse, the most obvious thing to do was to use those. It would seem that the latest mouse genome sequence at GENCODE is ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M4/GRCm38.p3.genome.fa.gz, and the corresponding GTF file is ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M4/gencode.vM4.chr_patch_hapl_scaff.annotation.gtf.gz. However, in addition to chromosomal, mitochondrial and scaffold DNA, these files apparently also contain patches and haplotypes and by looking at either the sequence or corresponding annotation file I am unable to differentiate between them. I am sure I am missing something obvious – but how to know what is what, so that I can throw it out of the sequence and annotation files and keep only what is recommended for building the genome index.

 

Thanks to everyone in advance for reading and helping out.

Alexander Dobin

unread,
Feb 25, 2015, 3:40:57 PM2/25/15
to rna-...@googlegroups.com
Hi,

I have compared GENCODE files with those provided by NCBI:
Full genome:
In these files there is a clear identification of the contig types in the reference names, with *_alt and *_patch.
The second file is the one that I would recommend for standard RNA-seq mapping.

However, Gencode fasta and GTF file that you mention use a somewhat different set of contigs, namely, they exclude the alternative contigs, but include the patches.
Moreover, they use a different naming convention for the contigs, so it's not possible to use their GTF file with NCBI fasta files.

The patch contigs in the Gencode GRCm38.p3.genome.fa file have the 2nd field of the name of the type: CHR_* , e.g. CHR_MG132_PATCH
So I think the best option would be to remove these patch contigs from the GENCODE GRCm38.p3.genome.fa file, and then use it with the GENCODE GTF file, which does not need to be modified.

Cheers
Alex

kows...@gmail.com

unread,
Mar 11, 2015, 5:52:10 AM3/11/15
to rna-...@googlegroups.com
Hello Alex,

Thanks for the detailed response and all the work you've done. I will certainly do as you suggested. Also, I understand that you have a lot on your plate and I hope it doesn't sound too pushy, but I think it would be helpful for new and/or inexperienced users if direct links to suggested references and annotations for most commonly used genomes and types of analyses (with brief info if some further modifications for working with STAR are required) would be included in the future versions of the manual.

Once again, thanks a bunch for your help.

Best regards,

Bero.

Alexander Dobin

unread,
Mar 15, 2015, 11:52:14 AM3/15/15
to rna-...@googlegroups.com
Hi Bero,

this will definitely be very useful, and I will try to move it up my TODO list. I would also like to encourage everyone to share their experiences with various genomes. I should probably setup a wiki.

Cheers
Alex

kows...@gmail.com

unread,
Jun 7, 2015, 3:31:34 PM6/7/15
to rna-...@googlegroups.com
Hi all,
 
just a quick info if anyone's interested. It looks like that, in a new M5 version, Gencode now offers both .fasta and .gtf files of the primary assembly (only chr + scaffolds).
Info:
http://www.gencodegenes.org/mouse_releases/current.html
DL links:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M5/GRCm38.primary_assembly.genome.fa.gz
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M5/gencode.vM5.primary_assembly.annotation.gtf.gz

Best regards,

Bero.

Alexander Dobin

unread,
Jun 10, 2015, 6:27:35 PM6/10/15
to rna-...@googlegroups.com, kows...@gmail.com
Hi Bero,

thanks for sharing this, it will make life easier for us. Hopefully, they will do it for human soon.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages