Ensembl GRCh37 Human Genome size, 28 GB - opinions on it as a reference?

844 views
Skip to first unread message

German Leparc

unread,
Jun 20, 2013, 5:07:33 AM6/20/13
to rna-...@googlegroups.com

Dear fellow rna-star users,

I have been for a time using the Illumina's iGenomes as the references for my genomes. However, at our location we have decided to stick with the official Ensembl genome releases.

A look at the latest release for Homo_sapiens.GRCh37.71.dna.toplevel.fa is surprising, at least to me. It is about 28 GB in size and this is about 10 times what the human genome is thought to be.
A closer look shows this release is full off PATCH FIXES, and GL0000XX supercontigs. Is this just a lot of redundant sequence? I cannot find much information about these and the best practices recommended for using this data as a genome reference.

I would like to include as much reference sequence as possible. It was previously discussed (I think by Heng Li of samtools) that it is important to include not just the karyotypic chromosomes, but also the unfinished contigs, to make sure your reads align  to their "original" loci and not introduce alignment artifacts by leaving out the incompletely assembled parts of the genome.

But I cannot imagine that this extra 20+ GB of sequence is really unique and correct to use for a RNA/DNA seq short read alignment strategy.

any thoughts? What a you doing regarding choose a reference genome? It is much easier for mouse/rat/etc genomes since the contigs have not exploded yet like in the human...



Shawn Driscoll

unread,
Jun 20, 2013, 9:00:52 PM6/20/13
to rna-...@googlegroups.com
Use the *dna.primary.fa.  The "toplevel" includes many alternative versions of entire chromosomes and aligning to that reference will undoubtedly cause some confusion.

On the subject of Ensemble - I like the Ensemble genomes mostly because they also provide very detailed GTF files.  As a result of their chromosome naming convention you have to keep a file handy that you can use to translate them to UCSC names if you want to use the genome browser or the IGV browser.

Alexander Dobin

unread,
Jun 22, 2013, 12:36:04 PM6/22/13
to rna-...@googlegroups.com
Hi German,

Shawn's suggestion is right on the spot. This file contains the major chromosomes (1-2,X,Y,MT) and GL contigs, but not the patches or haplotypes. The GL contigs add just a few mega-bases to the genome length.
I strongly agree with the recommendation of including GL contigs in the genome. We have found that whenever Ribo-depletion does not work well, a substantial number of reads would map to ribosomal repeats in the GL contigs - but would be reported as unmapped if the GL contigs are not included in the genome.

Cheers
Alex

German Leparc

unread,
Jul 31, 2013, 1:40:43 PM7/31/13
to rna-...@googlegroups.com
Thank you Shawn and Alexander. That cleared up everything for me!

Reply all
Reply to author
Forward
0 new messages