rRNA track is short fragments?

277 views
Skip to first unread message

J. C. Szamosi

unread,
Mar 30, 2016, 1:57:05 PM3/30/16
to gen...@soe.ucsc.edu
Hello all,

I downloaded the human rRNA track following the instructions here:
https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/rrna/genome/0X06cAZgHjU/YekSLOG7CQAJ

And did the same for mouse following similar instructions. In both cases, the gtf files that I downloaded are annotating a large number of very small sequences  (50 - 150bp in length). I find this very surprising. Are the whole rRNA regions not annotated as full, long sequences? The short sequences in the gtf file aren't even contiguous, so it doesn't even cover a long rRNA sequence as I would expect.

I'm hoping to use the gtf file to generate a fasta file of all the rRNA sequences from the genome, but right now this isn't working. Am I missing something?

Thanks

Jake
--
J. C. Szamosi, M.Sc.
Bioinformatician
Farncombe Metagenomics Facility
McMaster University

Matthew Speir

unread,
Mar 31, 2016, 1:52:59 PM3/31/16
to J. C. Szamosi, gen...@soe.ucsc.edu
Hi Jake,

Thank you for your question about obtaining coordinates for rRNA genes.

Many of the results generated by Table Browser query in the answer to the previous mailing list question are 5S rRNA pseudogenes. You can see these types of details by examining one of these annotations in the Genome Browser, clicking into the GENCODE details page for the item and then clicking through to the Ensembl website, where next to "Description" you should see some text like " RNA, 5S ribosomal pseudogene 40". For example http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000252956;r=1:9437669-9437778;t=ENST00000517147.

I believe that many of these pseudogenes can be removed from the output by improving the Table Browser filters applied to the results. In step 6 of those outlined in the answer here: https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/0X06cAZgHjU/Kn0WPsAiCgAJ, you can add the following filter under the hg38.wgEncodeGencodeAttrsV22 based filters' to those already described:

geneName doesn't match *5SP*

Additionally, this list of rRNA genes may not be an exhaustive list of all rRNA genes throughout the genome and only represents those rRNA genes that GENCODE has annotated. From a quick scan of the gene names output using the instructions in the previous mailing list answer plus the additional geneName filter I described, it looks like GENCODE only includes 5S and 5.8S rRNA genes in their annotations. According to the "Ribosomal RNA" Wikipedia page, https://en.wikipedia.org/wiki/Ribosomal_RNA#Prokaryotes_vs._eukaryotes, it looks like the 5S and 5.8S genes are expected to be between 100-156 bp in length.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Matthew Speir

unread,
Apr 5, 2016, 6:13:01 PM4/5/16
to J. C. Szamosi, gen...@soe.ucsc.edu
Hi Jake,

Thank you for the suggestion, I will make sure it's something that's taken into consideration for future responses concerning getting rRNA gene coordinates from the GENCODE track in the UCSC Genome Browser.

As for a track that may have a complete set of rRNA annotations, I'm not sure that we have one our public site. I was going to suggest the RefSeq Genes track, but even that only appears to contain a subset of the possible rRNA gene annotations. According to the GenBank description of the 28S rRNA, https://www.ncbi.nlm.nih.gov/nuccore/NR_003287.2?report=genbank, the regions containing the the 45S rRNA precursor for the 18S, 5.8S and 28S rRNA should be found on chromosomes 13, 14, 15, 21 and 22. However, our RefSeq Genes track only contains annotations for these three rRNA genes on chr21, an unplaced chr22 scaffold, and an unlocalized and unplaced scaffold:

NR_003287	chr21	+	8213887	8401980	rRNA	RNA28S5	
NR_003285	chr21	+	8212571	8212727	rRNA	RNA5-8S5	
NR_003285	chr21	+	8256780	8256936	rRNA	RNA5-8S5	
NR_003286	chr21	+	8209630	8211499	rRNA	RNA18S5	
NR_003285	chr21	+	8395606	8395762	rRNA	RNA5-8S5	
NR_003285	chr21	+	8439822	8439978	rRNA	RNA5-8S5	
NR_046235	chr21	+	8433221	8446572	rRNA	RNA45S5	
NR_003287	chr21	+	8441145	8446211	rRNA	RNA28S5	
NR_003286	chr21	+	8436875	8438744	rRNA	RNA18S5	
NR_003286	chr21	+	8392665	8394534	rRNA	RNA18S5	
NR_003287	chrUn_GL000220v1	+	113347	118417	rRNA	RNA28S5	
NR_046235	chrUn_GL000220v1	+	105423	118780	rRNA	RNA45S5	
NR_003285	chrUn_GL000220v1	+	112024	112180	rRNA	RNA5-8S5	
NR_003286	chrUn_GL000220v1	+	109077	110946	rRNA	RNA18S5	
NR_003285	chrUn_GL000220v1	+	155996	156152	rRNA	RNA5-8S5	
NR_003286	chrUn_GL000220v1	+	153049	154918	rRNA	RNA18S5	
NR_003287	chr22_KI270733v1_random	+	130203	135280	rRNA	RNA28S5	
NR_046235	chr22_KI270733v1_random	+	122272	135645	rRNA	RNA45S5	
NR_003285	chr22_KI270733v1_random	+	128876	129032	rRNA	RNA5-8S5	
NR_003286	chr22_KI270733v1_random	+	125930	127799	rRNA	RNA18S5	
NR_003285	chr22_KI270733v1_random	+	173955	174111	rRNA	RNA5-8S5	
NR_003286	chr22_KI270733v1_random	+	171011	172880	rRNA	RNA18S5	
In addition to these annotations, there are a number of 5S rRNAs. The RefSeq Genes track is based on aligning RNAs from RefSeq to the genome and then selecting those best alignments. In cases where there are two best alignments, both are kept. I know that GENCODE annotated a number of 5S pseudogenes, so it's possible that some of the multiple alignments for a particular 5S gene could overlap with some of the GENCODE 5S pseudogenes, though I haven't investigated this. You can read more about how the RefSeq Genes track is constructed on the track description page here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=refGene. If you're interested, I can give some instructions on how to extract rRNA annotations from the RefSeq Genes track.

Even though this output contains a few more annotations for the larger rRNA genes, I doubt that this would be a complete annotation of all the rRNA genes in the genome. This is because, according to Wikipedia, each cluster on chromosomes 13, 14, 15, 21, and 22 contains 30-40 repeats of the 45S rRNA precursor gene whereas the above list only contains one 45S annotation per chromosome and only a few 5.8S, 18S, and 28S annotations outside of these. That's not not even counting those rRNA annotations missing from chr13, chr14, and chr15. You can try asking this question of other, more general online biology help forums including Biostars or SeqAnswers to see if others have recommendations for finding a full set of all rRNA genes in the genome or how to best mask these rRNA genes from your RNA-seq analysis.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


On 4/4/16 7:28 AM, J. C. Szamosi wrote:
Thanks for your help, Matthew.

It seems that if only the 5S and 5.8S subunits are annotated, that this implies people should be aware of that fact when using that track as a mask file for RNASeq experiments. Do you know of a track that annotates all the rRNA in human or mouse genomes? And if not, it might be worth mentioning this in future posts providing the previously linked instructions, so investigators aren't led astray.

Thanks!

Jake

J. C. Szamosi

unread,
Apr 6, 2016, 10:11:15 AM4/6/16
to gen...@soe.ucsc.edu
Thanks so much for your help, Matthew!
Reply all
Reply to author
Forward
0 new messages