Same gene_id and transcript_id in GTF annotation file for hg19

175 views
Skip to first unread message

Phalchandra Venkatesh

unread,
Jun 14, 2017, 11:05:08 AM6/14/17
to gen...@soe.ucsc.edu
Hello,

I'm interested in getting a GTF file for hg19 in which gene_id and transcript_id is not same as wanted to run htseq-count.

I tried all the steps mentioned in this link to get a GTF file for hg19 knownGene:

But still I'm unable to get a GTF file in which gene_id and transcript_id is not same.

Can you please help to get a GTF file in the right format?

Thanks & Regards,
Phalchandra Venkatesh

Christopher Lee

unread,
Jun 15, 2017, 6:52:29 PM6/15/17
to Phalchandra Venkatesh, UCSC Genome Browser Discussion List

Hi Phalchandra,

Thank you for your question about obtaining a valid GTF file from the knownGene table. You will need to install MySQL and query our public MySQL server if you would like to get a genePred that will result in a GTF file with non-matching gene_id and transcript_id fields. After installing MySQL, here is a command that will result in a genePred that you can use with genePredToGtf:

$ mysql --host=genome-mysql.soe.ucsc.edu --user=genome -Ne "select a.name, a.chrom, a.strand, a.txStart, a.txEnd,\
a.cdsStart, a.cdsEnd, a.exonCount, a.exonStarts, a.exonEnds, 0 as score, b.geneSymbol from knownGene a join \
kgXref b on a.name=b.kgID" hg19 > hg19.genePred

That will result in a genePred file like the following:
uc001aaa.3    chr1    +    11873    14409    11873    11873    3    11873,12612,13220,    12227,12721,14409,    0    DDX11L1
uc010nxr.1    chr1    +    11873    14409    11873    11873    3    11873,12645,13220,    12227,12697,14409,    0    DDX11L1
uc010nxq.1    chr1    +    11873    14409    12189    13639    3    11873,12594,13402,    12227,12721,14409,    0    DDX11L1

Which you can then pass to genePredtoGtf using the file option:
genePredToGtf file hg19.genePred hg19.knownGene.gtf

which will have unique fields:
chr1    hg19.genePred    transcript    11874    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3";  gene_name "DDX11L1";
chr1    hg19.genePred    exon    11874    12227    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "1"; exon_id "uc001aaa.3.1"; gene_name "DDX11L1";
chr1    hg19.genePred    exon    12613    12721    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "2"; exon_id "uc001aaa.3.2"; gene_name "DDX11L1";
chr1    hg19.genePred    exon    13221    14409    .    +    .    gene_id "DDX11L1"; transcript_id "uc001aaa.3"; exon_number "3"; exon_id "uc001aaa.3.3"; gene_name "DDX11L1";

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAN5c_Teag%2BO-_bXTW5rVahurYAgoLOeXYxbcW9EhMWqOpcXXSg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages