Mask GTF File for RNA Seq

1,031 views
Skip to first unread message

Elmer Ker

unread,
Aug 27, 2013, 8:34:10 PM8/27/13
to gen...@soe.ucsc.edu
hi,
I am trying to download a GTF file using UCSC Table Browser for human and bovine genomes. What is the correct way to get a GTF file for rRNA, tRNA and mitochondria. I know that for mitochondria, I can use filter and 'chrM' under Chromosome but I am not sure how to do so for rRNA and tRNA. I saw this link: http://onetipperday.blogspot.com/2012/08/how-to-get-trnarrnamitochondrial-gene.html but it seems like that 'rmsk' table suggested on the webpage is for repeats rather than actual rRNA or tRNAs? (pls correct me if I am incorrect). I tried looking at previous posts in the UCSC genome support forum but most of the answers seem to link to older posts that no longer exists. I also could not register ('Administrator has disabled registration for today') so I could not post a new message.
thank you,
elmer

Jonathan Casper

unread,
Sep 3, 2013, 5:10:25 PM9/3/13
to Elmer Ker, gen...@soe.ucsc.edu
Hello Elmer,

Thank you for your question about obtaining GTF output. Your method of obtaining mitochrondrial genes is good, but the UCSC Genes data on chromosome M is a bit limited. UCSC Genes is built largely on RefSeq, which doesn't provide annotation of chromosome M. You may wish instead to use an alternate genes table like wgEncodeGencodeBasicV17 (part of the Gencode Genes V17 track, under the Genes and Gene Prediction Tracks heading), which has slightly more detailed data for that region. You're also quite right - the rmsk table is for repetitive elements. The page you link to seems specifically designed for obtaining a list of repetitive tRNA and rRNAs, which explains why they use that table. If you would instead like a list of all tRNAs and rRNAs, you should use different tables.

For rRNA info, we again suggest you try retrieving data from the wgEncodeGencodeBasicV17 table. For more information on obtaining rRNAs from this track, please see the answer to this question: https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/p-7101I71ak/discussion. The only required change to those instructions is to select GTF output instead of BED.

For tRNA info, we suggest you retrieve data from the tRNA Genes track (table tRNAs), also under the Genes and Gene Prediction Tracks heading. This set of predicted tRNA includes some pseudo-tRNAs that you may wish to filter out. If so, you can do this by clicking on the filter: create button in the Table Browser and applying the following setting: aa doesn't match Pseudo.

You can post questions to the UCSC Genome Discussion group by doing exactly what you've done here - sending email to gen...@soe.ucsc.edu. If you're interested in following traffic on the UCSC Genome Discussion list, you can subscribe to the Google group at https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome. We can also subscribe you directly to the mailing list with another email address if you don't have a Google Groups account. The old mailing list archives are currently being transitioned to this group, which is why the links to older posts are breaking. If there is a specific thread you're searching for, you should be able to find it by using the Google Groups search box.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--
 

Elmer Ker

unread,
Sep 3, 2013, 5:39:02 PM9/3/13
to gen...@soe.ucsc.edu
hi,
thank you so much for your reply. I am doing RNA seq for both human and bovine sequences. While the wgEncodeGencodeBasicV17 is available for homosapiens, it is not available for bovine. Does it matter if I use different tables for different species? Ideally, I'll like to compare the two eventually [a very rough human vs cow RNA seq expt] so I'm guessing it's preferable to use as similar or exact settings as possible for consistency. But having said that, if that is not possible, I would like to follow whatever recommendations you may have for the tracks and tables for chromsome M, rRNA and tRNA for bovine? Sorry to bother you with so many questions - I am new to DNA/RNA seq.
elmer

Date: Tue, 3 Sep 2013 14:10:25 -0700
Subject: Re: [genome] Mask GTF File for RNA Seq
From: jca...@soe.ucsc.edu
To: elme...@hotmail.com
CC: gen...@soe.ucsc.edu

Jonathan Casper

unread,
Sep 5, 2013, 4:59:00 PM9/5/13
to Elmer Ker, gen...@soe.ucsc.edu

Hello Elmer,

Unfortunately the annotation we have for the cow genome is somewhat limited, which constrains your options if you're set on comparing data from corresponding tables between the cow and human genomes. Your only option may be to use the tRNA and rRNA data from the rmsk table as described in the link you provided. Whether this is appropriate for your research is a question that goes beyond the scope of this mailing list - it will depend on the specific question you're trying to ask and how the datasets were originally constructed.

Using RepeatMasker data may well be enough for your initial rough comparison. Following the instructions in the link you provided generates about 2000 tRNA records from the human rmsk table. This is rather larger than the ~600 in the predicted tRNAs track, but at least appears to cover all but 34 of the predicted items. For more information about RepeatMasker data you can consult their website at www.repeatmasker.org.

As an alternative for tRNA data, you can run the same prediction software on the cow genome that we used to generate the human tRNAs track. More information about that track and its creation is available on the track description page here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=tRNAs. A similar project involving other organisms is described in this paper:

Goodenbour JM, Pan T. Diversity of tRNA genes in eukaryotes. Nucleic Acids Res. 2006;34(21):6137-46. PMID: 17088292; PMC: PMC1693877.
http://nar.oxfordjournals.org/content/34/21/6137

Chromosome M annotation is similarly limited, particularly for the cow genome. You may be able to obtain some results by taking the the Gencode chromosome M annotation from the human genome and searching for those sequences in cow with an alignment tool like BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat). BLAT has difficulty aligning repetitive sequences like rRNAs and tRNAs, but should have less trouble finding matches for mitochondrial sequences.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--
 

Elmer Ker

unread,
Sep 5, 2013, 9:17:01 PM9/5/13
to gen...@soe.ucsc.edu

hi Jonathan,

thank you for your help. I will probably just stick to the rmsk for cow rRNA and tRNA. I tried the Blat approach you suggested for cow ChrM but could not find a way to easily download the search results into GTF format. In addition, Blat pulled up sequences that were not on Cow ChrM. I ended up doing the following at the browser table

1.      Select “Mammal”, “Cow”, “Oct 2011 (Baylor Btau_4.6.1/bosTau7)”

2.      Group: Genes and Gene Prediction Tracks

3.      Track: RefSeq Genes

4.      Table: Cow ESTs (All ESTs)

5.      Output format: GTF – gene transfer format

6.      Filename: bosTau7_GenesNGenePredictions_RefSeq_CowESTs.gtf

7.      Click "region"

8.      Select mitochondria with: ‘chrM’ in the ‘position’ field and click 'lookup'.

9.      Click ‘get output’ on that page


I wanted to know if you think that's a 'legal' thing to do?

elmer


Date: Thu, 5 Sep 2013 13:59:00 -0700

Jonathan Casper

unread,
Sep 6, 2013, 4:53:23 PM9/6/13
to Elmer Ker, gen...@soe.ucsc.edu

Hello Elmer,

That is certainly a legal thing to do from the perspective of the Table Browser. Whether examining ESTs is appropriate for your research is a question that really only you can answer. The steps you describe will give you a list of all ESTs from the all_est table that appear in chromosome M of the cow genome. You can also access this table from the track group "mRNA and EST Tracks", track "Cow ESTs", table "all_est". In that same group of tracks we have a related "spliced ESTs" track that may be useful to you - it contains only those ESTs that show evidence of at least one canonical intron. More information about that track is available here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=bosTau7&g=intronEst.

You may also be interested in the answers to these mailing list questions: https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/zR-BFms-lEk/discussion,https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/-JMFaCvkUBM/discussion.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--
 

Reply all
Reply to author
Forward
0 new messages