Dear Gary,
Thank you for your question about the UCSC Genome Browser. In order to retrieve the GENCODE v24 GTF file for hg38 using genePredToGtf via command line, you will want to use the wgEncodeGencodeBasicV24 and/or wgEncodeGencodeCompV24 tables. Also, the knownGene table contains the same gene models as wgEncodeGencodeCompV24 but with UCSC identifiers that relate to our internal schema. Below is example output from the genePredToGtf utility for the ALL GENCODE V24's comprehensive set for hg38:
genePredToGtf hg38 wgEncodeGencodeCompV24 hg38FileTest.gtf
here is sample output:
chr1 wgEncodeGencodeCompV24 transcript 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 exon 17369 17436 . - . gene_id "MIR6859-1"; transcript_id "ENST00000619216.1"; exon_number "1"; exon_id "ENST00000619216.1.1"; gene_name "MIR6859-1";
chr1 wgEncodeGencodeCompV24 transcript 29554 31097 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 29554 30039 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "1"; exon_id "ENST00000473358.1.1"; gene_name "RP11-34P13.3";
chr1 wgEncodeGencodeCompV24 exon 30564 30667 . + . gene_id "RP11-34P13.3"; transcript_id "ENST00000473358.1"; exon_number "2"; exon_id "ENST00000473358.1.2"; gene_name "RP11-34P13.3";
After outputting your gtf file (e.g., hg38FileTest.gtf) from the utility, you can then use the following Perl script to find and replace the "gene ID" with the Ensembl ID, which is named "transcript_id."
perl -wpe 's/gene_id "[^"]+"; transcript_id "([^"]+)"/gene_id "$1"; transcript_id "$1"/;' hg38FileTest.gtf > file.ENST.gtf
Note: Although it is possible to use the Table Browser (
https://genome.ucsc.edu/cgi-bin/hgTables) to output a limited version of gtf format for certain tables, we recommend using the genePredToGtf utility.
I hope this is helpful. If you have any further questions, please reply to
gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.
-Chris V
UCSC Genome Browser