Dear fellow rna-star users,
I have been for a time using the Illumina's iGenomes as the references for my genomes. However, at our location we have decided to stick with the official Ensembl genome releases.
A look at the latest release for Homo_sapiens.GRCh37.71.dna.toplevel.fa is surprising, at least to me. It is about 28 GB in size and this is about 10 times what the human genome is thought to be.
A closer look shows this release is full off PATCH FIXES, and GL0000XX supercontigs. Is this just a lot of redundant sequence? I cannot find much information about these and the best practices recommended for using this data as a genome reference.
I would like to include as much reference sequence as possible. It was previously discussed (I think by Heng Li of samtools) that it is important to include not just the karyotypic chromosomes, but also the unfinished contigs, to make sure your reads align to their "original" loci and not introduce alignment artifacts by leaving out the incompletely assembled parts of the genome.
But I cannot imagine that this extra 20+ GB of sequence is really unique and correct to use for a RNA/DNA seq short read alignment strategy.
any thoughts? What a you doing regarding choose a reference genome? It is much easier for mouse/rat/etc genomes since the contigs have not exploded yet like in the human...