GTF file of Introns

2,194 views
Skip to first unread message

Andy Rampersaud

unread,
Feb 20, 2015, 12:40:29 PM2/20/15
to gen...@soe.ucsc.edu
Hello,

Is there a way to generate a GTF file of intron regions the same way there are GTF files of exon regions from the Table Browser? 

To provide some context: we have RNA-Seq samples.  We're using htseq-count to count the number of reads in exon regions.  htseq-count uses a GTF file that specifies the gene structure.  The GTF file we have contains either "exon" or "CDS".

Is there a way we can get a similarly formatted GTF file of introns?

Thanks,
Andy

--
Andy Rampersaud
Graduate Student, Bioinformatics
Waxman Lab, Boston University

Jonathan Casper

unread,
Feb 25, 2015, 12:34:31 PM2/25/15
to Andy Rampersaud, gen...@soe.ucsc.edu

Hello Andy,

Thank you for your question about generating GTF files of intron regions. There is a way to generate a GTF file that contains the intron regions of a track, but it will be a two-step process. Due to the mechanics of how GTF files work, the result will be a GTF file that picks out the intron regions of your data, but they will be called exons within the GTF file itself.

The basic overview of what we will do here is take your track data and use the Table Browser to create a custom track that just contains the introns from your data. Then we will use the Table Browser a second time to generate GTF output from that custom track.

1. Load your data file as a custom track.
2. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTracks and select your custom track.
3. Set up any region and identifier filters that you want to impose (or just set the region to "genome"), and set the output format to "custom track".
4. Click "get output".
5. On the next page, name your new custom track (e.g., "introns from my data") and select the box to create one BED record per "Introns plus 0 bases on each end".
6. Click "get custom track in table browser". You should now be returned to the main UCSC Table Browser page.
7. Select your new custom track from the menu (e.g., Group: Custom Tracks, Track: introns from my data).
8. Change the output format to "GTF - gene transfer format".
9. Click "get output".
The result should be a GTF file that describes the intron regions of your data.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Andy Rampersaud

unread,
Mar 3, 2015, 12:27:31 PM3/3/15
to Jonathan Casper, gen...@soe.ucsc.edu
Hi Jonathan,

Thank you for your answer, I'll have the test the process soon.  I do have a related question about the format of GTF files for RefSeq genes.  I've attached a screen shot to summarize my parameters for the Table Browser:

Inline image 1

I load this file into R and run the following:

#Get all the unique accession numbers:
#length(unique(Gene_List$"name"))
#[1] 33555
#There are 33,555 unique accession numbers

Next I went back to the Table Browser and downloaded the same "refGene" table but as a GTF file.  The first thing I noticed is that the gene symbol is not present in the file.  It looks like accession numbers are in place of "gene_id" and "transcript_id".  Loading this file in R:

#length(unique(RefSeq_GTF$"gene_id"))
#[1] 33599
#length(unique(RefSeq_GTF$"transcript_id"))
#[1] 34599


Question 1: Why are there different numbers of unique accession numbers even though I'm only changing the output file format (first downloading all fields, then downloading as a GTF file)?

Question 2: Is there a way I can get a GTF file of RefSeq genes with gene symbols as they show up in the refGene table?  We need these gene symbols when we summarize our RNA-Seq read counts.

Thanks,
Andy



Jonathan Casper

unread,
Mar 4, 2015, 7:03:40 PM3/4/15
to Andy Rampersaud, gen...@soe.ucsc.edu

Hello Andy,

The GTF output from the UCSC Table Browser is a bit limited, in that it re-uses the transcript ID to fill the gene ID column. You can obtain a GTF file with actual gene identifiers (from the "name2" field of the refGene table) by using the genePredToGtf utility as described at http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format.

Regarding the counts you found for unique identifiers, I have a few comments. I'm not sure how R arrived at the figure of 33555 unique names in the table dump - that does not match what I see in the data. The number should be 33599. The reason that there are an extra 1000 transcript IDs (for a total of 34599) is that GTF transcript identifiers are required to be unique. When we have mapped a transcript to multiple locations in the genome (e.g., NM_001025241), then the Table Browser must generate additional transcript identifiers for the alternate locations in GTF output. Those identifiers take the form of NM_001025241_dup1, NM_001025241_dup2, and so on. The Table Browser does not generate alternate gene names. For contrast, the output of genePredToGtf contains 34675 distinct transcript names - more than the output from the UCSC Table Browser. In this case, it looks like there are some transcripts for which genePredToGtf gave alternate identifiers while the Table Browser (for some reason) did not.

Overall, we recommend using the genePredToGtf output.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

Andy Rampersaud

unread,
Mar 18, 2015, 12:02:24 PM3/18/15
to Jonathan Casper, gen...@soe.ucsc.edu
Hi Jonathan,

Thank you for your explanation and suggestion to use the genePredToGtf utility. 

I just wanted to follow up on the issue of unique accession number counts.  I'm confident that R is correctly counting the number of unique accession numbers.  I think the issue is the refGene table (when downloaded from the UCSC Table Browser) is missing these 44 accession numbers.  I did a comparison between the Table_List of accession numbers and GTF_List of accession numbers:

#length(intersect(Table_List,GTF_List))
#[1] 33555
#length(setdiff(Table_List,GTF_List))
#[1] 0
#length(setdiff(GTF_List,Table_List))
#[1] 44

This indicates the Table_List is a subset of the GTF_List.  I looked at some examples: NM_001305046 ("Uap1") and NM_001305059 ("Tex35").  These genes are present in both lists - just under different accession numbers.  I'm guessing these 44 accession numbers maybe out-dated and were re-assigned.

My preference for the 33555 unique accession numbers stems from downstream analysis completed where I downloaded the exon and intron locations from the Table Browser.  I ended up making my own GTF file with these accession numbers and did a table merge to get the gene symbols in the GTF file (hopefully these are all valid operations).  Can I confirm that there are 24,197 unique gene_symbols in RefSeq?

Andy

Jonathan Casper

unread,
Mar 26, 2015, 6:30:49 PM3/26/15
to Andy Rampersaud, gen...@soe.ucsc.edu

Hello Andy,

Sorry for the delay on this. The number of entries in the refGene table is steadily changing, as we regularly incorporate new additions to the RefSeq library of transcripts. Right now, we count 24327 distinct gene symbols in the refGene table for the mm9 assembly. Your count of 24197 sounds close enough to be accurate if you downloaded your dataset some time ago (as seems likely given the difference in transcript counts).

Using the refGene table dump as an index to insert gene symbols into your GTF file sounds like it should work just fine; the transcript names in each file are coming from the same data source anyway.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

Reply all
Reply to author
Forward
0 new messages