GTF files for Human data hg38

1,713 views
Skip to first unread message

neeraja M

unread,
Jul 20, 2015, 3:32:21 PM7/20/15
to gen...@soe.ucsc.edu
Hi UCSC team,

I need to download GTF file for hg38 data. I have used the link http://genome.ucsc.edu/cgi-bin/hgTables for downloading files. Following are the options i used.

Clade: Mammal
genome : human
assembly: hg38
group: Genes and Gene predictions
track: RefSeq genes
table: refGene
region: genome
output format: GTF - Gene Transfer format
output file: hg38_refGene.gtf

then clicked the button get output. I got a file with the following lines
chr1    hg38_refGene    exon    67092176        67093604        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 
chr1    hg38_refGene    exon    67096252        67096321        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 
chr1    hg38_refGene    exon    67103238        67103382        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 
chr1    hg38_refGene    exon    67111577        67111644        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 
chr1    hg38_refGene    exon    67113614        67113756        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 
chr1    hg38_refGene    exon    67115352        67115464        0.000000        -       .       gene_id "NR_075077"; transcript_id "NR_075077"; 

Here I m getting the same data for both gene_id and transcript_id.

When I m using the below options, 

Clade: Mammal
genome : human
assembly: hg38
group: Genes and Gene predictions
track: RefSeq genes
table: refFlat
region: genome
output format: GTF - Gene Transfer format
output file: hg38_refFlat.gtf

I m getting the following output.

chr1    hg38_refFlat    exon    11874   12227   0.000000        +       .       gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg38_refFlat    exon    12613   12721   0.000000        +       .       gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg38_refFlat    exon    13221   14409   0.000000        +       .       gene_id "DDX11L1"; transcript_id "DDX11L1"; 
chr1    hg38_refFlat    exon    14362   14829   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    14970   15038   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    15796   15947   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    16607   16765   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    16858   17055   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    17233   17368   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    17606   17742   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 
chr1    hg38_refFlat    exon    17915   18061   0.000000        -       .       gene_id "WASH7P"; transcript_id "WASH7P"; 

Here also I m getting both gene_id and transcript_id which are originally the gene symbols.

If i m supposed to have gene symbol as the value of gene_id tag & refseq transcript id as the value of transcript id tag, what options should I select? Please guide me on this. Thanks in advance.


--


Regards,

Neeraja M

Jonathan Casper

unread,
Jul 20, 2015, 7:48:26 PM7/20/15
to neeraja M, gen...@soe.ucsc.edu

Hello Neeraja,

Thank you for your question about obtaining GTF output from the UCSC Table Browser. The GTF output options for the UCSC Table Browser are quite limited, and it does not have the ability to create GTF output as you request. We suggest that instead you use our command-line tool genePredToGtf, which generates GTF files with appropriate transcript IDs and gene symbols.

More information on using genePredToGtf is available on our wiki at http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format. You can find precompiled versions of the genePredToGtf program at http://hgdownload.soe.ucsc.edu/admin/exe/. If your computer architecture is not among those listed in that directory, you can also download the source code file userApps.src.tgz and compile the program yourself. To run the program on the hg38 refGene table, you will need to use the command

   genePredToGtf hg38 refGene refGene.gtf

instead of the one listed on the wiki page, which retrieves the hg19 knownGene table.

If you are unable to run the genePredToGtf tool, then you may be able to use the online text manipulation tools at Galaxy (https://usegalaxy.org) to edit the original GTF output from the UCSC Table Browser. Here is a mailing list question that discusses adding gene symbols with Galaxy: https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/T5UN1mt79Tc/discussion.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages