GENCODE gtf

152 views
Skip to first unread message

Trakhtenberg, Feliks

unread,
Aug 11, 2014, 11:23:19 AM8/11/14
to gen...@soe.ucsc.edu

Hello,

 

I would appreciate if some could explain why the GENCODE gtf generated through the Table Browser is lacking gene, transcript, UTR, and Selenocysteine rows, which are present in the original GENCODE file. I plan to use this gtf for Tophat/Cufflinks RNA-seq analysis and just wanted to make sure I am using the right file.


When will the GENCODE mouse V3 be available through the Table Browser?


Is the table option called Comprehensive have the most of GENCODE transcripts, including those that are only predicted? Or other GENCODE tables, such as pseudogenes, have additional transcripts?


Is everything that is in the UCSC Gene table also included in the Comprehensive GENCODE table?


Thank you

Ephraim Trakhtenberg, PhD

Steve Heitner

unread,
Aug 11, 2014, 5:46:56 PM8/11/14
to Trakhtenberg, Feliks, gen...@soe.ucsc.edu

Hello, Ephraim.

To address all of your questions:

1.  We recommend that you get the GTF files from GENCODE (http://www.gencodegenes.org).  The Table Browser generates least common denominator GTFs for a lot of tracks and will not contain all of the information available in the official GENCODE GTFs.

2.  The GENCODE mouse V3 track will hopefully be available this month (August 2014).

3.  For information regarding the different GENCODE subtracks available at UCSC, I recommend reading through the description page at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=wgEncodeGencodeVM2.

4.  Concerning whether or not the GENCODE track contains everything contained in the UCSC Genes track, I don’t believe this can be answered definitively.  The UCSC Genes track is based on GenBank while the GENCODE track is based on Ensembl.  Because these are constructed using completely different methods, you will find in many cases that GenBank contains items that Ensembl does not and vice versa.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Trakhtenberg, Feliks

unread,
Aug 12, 2014, 3:02:10 PM8/12/14
to gen...@soe.ucsc.edu

Hello,

 

Regarding your answer in point 4 below, is it possible to identify which UCSC Genes track transcripts from GenBank are not found in Ensembl and GENCODEv3? I would like to add them to the GENCODE gtf but do not want redundancies.

What about Refseq transcripts - might there also be some that are included in the UCSC Genes track but not in GENCODEv3, similar to how you explained about the GenBank transcripts?

 

Thank you,

Ephraim

 


From: Steve Heitner [st...@soe.ucsc.edu]
Sent: Monday, August 11, 2014 5:46 PM
To: Trakhtenberg, Feliks; gen...@soe.ucsc.edu
Subject: RE: [genome] GENCODE gtf

Jonathan Casper

unread,
Aug 14, 2014, 9:10:21 PM8/14/14
to Trakhtenberg, Feliks, gen...@soe.ucsc.edu

Hello Ephraim,

Our engineers comment that it is difficult to advise you on how to combine gene sets without knowing what you're trying to accomplish specifically. Different gene sets use different predictive models, making it hard to combine them in a scientifically meaningful way.

That said, you can use the UCSC Table Browser intersection tool to get a list of entries found in UCSC Genes but not in GENCODE.

1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables
2. Use the following settings

clade: Mammal
genome: Mouse
assembly: Dec. 2011 (GRCm38/mm10)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: genome

3. Click the "intersection: create" button
4. On the "Intersect with UCSC Genes" page, set the following options:

group: Genes and Gene Predictions
track: GENCODE Genes VM2 (or V3, after it is released)
table: Basic (wgEncodeGencodeBasicVM2)

If you decide after reading the GENCODE track page that the Comprehensive table would be more useful to you, that is also an option.

5. Choose to return "All UCSC Genes records that have no overlap with GENCODE Genes VM2"

Note that the "no overlap" requirement here is fairly strict. You may wish to instead restrict to UCSC Genes records with no more than 50% overlap, for example, depending on your needs.

6. Click "submit" to return to the main Table Browser page

Note that the output format has been changed to BED. You can leave it in that way or change to GTF output. Just remember that the GTF output of the UCSC Table Browser will not exactly match the format of your GENCODE GTF file.

7. Click "get output"

We also have command line tools that will perform this kind of operation, but they are not designed to work with files in GTF. If you would like to explore this alternative, the relevant programs are called "featureBits" and "overlapSelect". They are available as part of the kent utilities on our download server at http://hgdownload.soe.ucsc.edu. We provide precompiled binaries for these utilities at http://hgdownload.soe.ucsc.edu/admin/exe/, but only for a few computer architectures. You may need to download the source code and compile these tools yourself if your computer is not listed there. You can run each program by itself on a command line with no arguments to see a description of how to use it.

As for your other question, RefSeq is a curated set of transcripts drawn from GenBank. Like GenBank, it is quite possible that there will be RefSeq transcripts that are not represented in GENCODE.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group



--


Trakhtenberg, Feliks

unread,
Sep 8, 2014, 12:09:33 PM9/8/14
to Jonathan Casper, gen...@soe.ucsc.edu

Hello,

 

Thank you for the advice. My goal is to predict novel genes/transcripts. I would like to compile a comprehensive mouse GTF, so that it does not turn out that the novel transcripts I find in my RNAseq have already been predicted in some major database. So, I thought that merging Gencode and UCSC Genes would provide such comprehensive set. Please let me know if this is insufficient.

Using the intersection tool you recommended below, even with no overlap selection, there are about 8k UCSC Gene transcripts not in the Gencode. Does the Table Browser have an option for merging these entries with the Gencode GTF? If not, would this command "cat out.gtf0[0-1] > merged.gtf” produce a GTF that is compatible with the Table Browser?

The UCSC Gene GTF produced by the Table Browser reports gene and transcript IDs like this: gene_id "uc007aet.1"; transcript_id "uc007aet.1". However, it does not add to the entry the original database (e.g., RefSeq) accession nor gene name. Gencode GTF from the Table Browser also missing the gene names. How could I have the original database IDs and the gene names included in the UCSC Gene GTF produced by the Table Browser, and the gene names included in the Gencode GTF from the Table Browser?

 

Thanks,
Ephraim


From: Jonathan Casper [jca...@soe.ucsc.edu]
Sent: Thursday, August 14, 2014 9:10 PM
To: Trakhtenberg, Feliks
Cc: gen...@soe.ucsc.edu
Subject: Re: [genome] GENCODE gtf

Steve Heitner

unread,
Sep 11, 2014, 1:05:56 PM9/11/14
to Trakhtenberg, Feliks, Jonathan Casper, gen...@soe.ucsc.edu

Hello, Ephraim.

There is no specific order in a GTF file, so it should not be a problem to cat both files into a single file.  Regarding the gene symbols being a part of the GTF output, this is a limitation of the way the Table Browser creates GTF output.  If you would like the gene symbols to be a part of your GTF files, it will require some scripting on your part.  We cannot provide advice on creating a script, but if you would like to use the Table Browser to provide output that will equate transcript ID to gene symbol and RefSeq ID for use in your script, you can follow these instructions:

For GENCODE:

1. As your output format, select “selected fields from primary and related tables”

2. Click the “get output” button

3. In the “Select Fields from mm10.wgEncodeGencodeCompVM2” section, check the “name” and “name2” checkboxes

4. Click the “get output” button

For UCSC Genes:

1. As your output format, select “selected fields from primary and related tables”

2. Click the “get output” button

3. In the “Select Fields from mm10.knownGene” section, check the “name” checkbox

4. In the “mm10.kgXref fields” section, check the “geneSymbol” and “refseq” checkboxes

5. Click the “get output” button

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--

Trakhtenberg, Feliks

unread,
Sep 11, 2014, 2:27:16 PM9/11/14
to st...@soe.ucsc.edu, Jonathan Casper, gen...@soe.ucsc.edu

thank you for the advice. I followed the instructions below for UCSC Genes equating transcript ID to gene symbol and RefSeq ID. There are cases where UCSC Genes transcripts do not have a corresponding transcript ID from RefSeq. Where these derived from other databases? If so, how could I get those IDs outputted as well? Here is an example:

uc007aif.1 Slco5a1 NM_172841
uc007aig.1 Slco5a1 
uc007aih.1 Slco5a1 
uc007aii.1 Slco5a1 

 

Thank you,

Ephraim

 


From: Steve Heitner [st...@soe.ucsc.edu]
Sent: Thursday, September 11, 2014 1:06 PM
To: Trakhtenberg, Feliks; 'Jonathan Casper'

Cc: gen...@soe.ucsc.edu
Subject: RE: [genome] GENCODE gtf

Hello, Ephraim.

There is no specific order in a GTF file, so it should not be a problem to cat both files into a single file.  Regarding the gene symbols being a part of the GTF output, this is a limitation of the way the Table Browser creates GTF output.  If you would like the gene symbols to be a part of your GTF files, it will require some scripting on your part.  We cannot provide advice on creating a script, but if you would like to use the Table Browser to provide output that will equate transcript ID to gene symbol and RefSeq ID for use in your script, you can follow these instructions:

For GENCODE:

1. As your output format, select “selected fields from primary and related tables”

2. Click the “get output” button

3. In the “Select Fields from mm10.wgEncodeGencodeCompVM2” section, check the “name” and “name2” checkboxes

4. Click the “get output” button

For :

Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Trakhtenberg, Feliks

unread,
Sep 11, 2014, 4:08:39 PM9/11/14
to st...@soe.ucsc.edu, Jonathan Casper, gen...@soe.ucsc.edu

actually, checking mRNA ID instead of RefSeq ID appears to output RefSeq and Genbank accession IDs which between them cover all UCSC Genes transcripts.

Thanks,

Ephraim


From: Trakhtenberg, Feliks
Sent: Thursday, September 11, 2014 1:23 PM
To: st...@soe.ucsc.edu; 'Jonathan Casper'

Cc: gen...@soe.ucsc.edu
Subject: RE: [genome] GENCODE gtf

thank you for the advice. I followed the instructions below for UCSC Genes equating transcript ID to gene symbol and . There are cases where do not have a corresponding transcript ID from RefSeq. Where these derived from other databases? If so, how could I get those IDs outputted as well? Here is an example:

Trakhtenberg, Feliks

unread,
Sep 18, 2014, 10:59:52 AM9/18/14
to st...@soe.ucsc.edu, Jonathan Casper, gen...@soe.ucsc.edu

Hello,

 

Do you know the expected date when the Gencode M3 track would become available in the Table Browser?

I need it for using the intersection tool to identify UCSC Gene entries that do not overlap with the Gencode M3. I presume that uploading Gencode M3 GTF as a costume track to accomplish my goal would be problematic, because if it was as simple as that I guess it would already have been available in the Table Browser. Is this so? Or uploading it as a costume track may work for my purposes?

 

Thank you,

Ephraim

 


From: Steve Heitner [st...@soe.ucsc.edu]

Sent: Thursday, September 11, 2014 1:06 PM
To: Trakhtenberg, Feliks; 'Jonathan Casper'

Matthew Speir

unread,
Oct 1, 2014, 11:58:27 AM10/1/14
to Trakhtenberg, Feliks, st...@soe.ucsc.edu, Jonathan Casper, gen...@soe.ucsc.edu
Hello Feliks,

Thank you for your questions about GENCODE Genes VM3 for mm10. The newest update to the GENCODE Genes tracks is currently in our quality assurance queue pending review by one our staff. A pre-release version of this track is available on our test server, http://genome-preview.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=wgEncodeGencodeVM3. You can intersect this GENCODE Genes track with the UCSC Genes track available on our preview server following the same steps discussed by my colleague Steve in previous emails, except replacing any mention of VM2 with VM3. However, please keep in mind that this our test server, and that the data has not undergone our standard quality review process and may be subject to change. We do hope to release this track to our public website at http://genome.ucsc.edu/ soon, but I do not have a projected date of when that might happen.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Trakhtenberg, Feliks

unread,
Oct 1, 2014, 1:19:31 PM10/1/14
to Matthew Speir, st...@soe.ucsc.edu, Jonathan Casper, gen...@soe.ucsc.edu

thank you very much Matthew!

Ephraim

 


From: Matthew Speir [msp...@soe.ucsc.edu]
Sent: Wednesday, October 01, 2014 11:58 AM
To: Trakhtenberg, Feliks; st...@soe.ucsc.edu; 'Jonathan Casper'
Cc: gen...@soe.ucsc.edu
Reply all
Reply to author
Forward
0 new messages