How to get a GENCODE v24 GTF file for hg38 using genePredToGtf

93 views
Skip to first unread message

Yung-Chih Lai

unread,
Jan 12, 2017, 3:59:54 PM1/12/17
to UCSC Genome Browser Discussion List

Hi,

 

I only know how to get a refGene GTF file using the below command line. Could you tell me how to get a GENCODE v24 GTF file for hg38 using genePredToGtf? In addition, how could I select the basic or comprehensive annotations? May I select an Ensembl transcript ID, but not a gene symbol, as the gene ID? Many thanks.

 

genePredToGtf -source=UCSChg38refGene20170102 -utr -addComments hg38 refGene hg38refGeneUtrComment.gtf

 

Best,

 

Gary

Chris Villarreal

unread,
Jan 13, 2017, 5:16:16 PM1/13/17
to Yung-Chih Lai, UCSC Genome Browser Discussion List
Dear Gary,

Thank you for your question about the UCSC Genome Browser. In order to retrieve the GENCODE v24 GTF file for hg38 using genePredToGtf via command line, you will want to use the wgEncodeGencodeBasicV24 and/or wgEncodeGencodeCompV24 tables. Also, the knownGene table contains the same gene models as wgEncodeGencodeCompV24 but with UCSC identifiers that relate to our internal schema. Below is example output from the genePredToGtf utility for the ALL GENCODE V24's comprehensive set for hg38:

genePredToGtf hg38 wgEncodeGencodeCompV24 hg38FileTest.gtf

here is sample output:

chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";

After outputting your gtf file (e.g., hg38FileTest.gtf) from the utility, you can then use the following Perl script to find and replace the "gene ID" with the Ensembl ID, which is named "transcript_id."

perl -wpe 's/gene_id "[^"]+"; transcript_id "([^"]+)"/gene_id "$1"; transcript_id "$1"/;' hg38FileTest.gtf > file.ENST.gtf

Note: Although it is possible to use the Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) to output a limited version of gtf format for certain tables, we recommend using the genePredToGtf utility.
and please see this previously answered mailing list question describing table names and organization for the "ALL GENCODE V24" sets. https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome/Oj41ZcVXyOc/nt0qTJ8C5_gJ

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

-Chris V
UCSC Genome Browser

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Reply all
Reply to author
Forward
0 new messages