GTF get more extended file

36 views
Skip to first unread message

Britta M

unread,
Jul 13, 2017, 11:26:12 AM7/13/17
to gen...@soe.ucsc.edu
Dear all,

I would like to download a UCSC-gtf having all categories such as exon, cds, UTRs, intron, gene inside. I figured out that I can make custom tracks for most of my wishes, and than later merge this file and than cointaing most of the categories:

However I would like to generate "gene" as a feature 
and e.g. have exon counts somewhere and get "exon 1".
Is this possible?
 
I saw some similar questions and answers here:
1. https://www.biostars.org/p/13290/ and in your group here:

"The basic overview of what we will do here is take your track data and use the Table Browser to create a custom track that just contains the introns from your data. Then we will use the Table Browser a second time to generate GTF output from that custom track.

1. Load your data file as a custom track.
2. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTracks and select your custom track.
3. Set up any region and identifier filters that you want to impose (or just set the region to "genome"), and set the output format to "custom track".
4. Click "get output".
5. On the next page, name your new custom track (e.g., "introns from my data") and select the box to create one BED record per "Introns plus 0 bases on each end".
6. Click "get custom track in table browser". You should now be returned to the main UCSC Table Browser page.
7. Select your new custom track from the menu (e.g., Group: Custom Tracks, Track: introns from my data).
8. Change the output format to "GTF - gene transfer format".
9. Click "get output".
The result should be a GTF file that describes the intron regions of your data."

This works fine, but my question here is, how do I get a gtf file for purely gene as a feature (not exon or cds). If I press whole gene I get the two feature "exon" and "Cds". This is not what I want.

2. One other question to this answer here: https://genome.ucsc.edu/FAQ/FAQdownloads.html#download37
Is there are genepred for any organism? Can you tell me how to navigate to these output format?

Thanks a lot!!

Cath Tyner

unread,
Jul 13, 2017, 6:33:18 PM7/13/17
to Britta M, UCSC Genome Browser Public Help Forum

Hi Britta,

Thank you for contacting the UCSC Genome Browser support team. I'm not quite sure what you are asking. In order to best support you, could you provide more information? Can you respond to this forum with the following information:

1. Please list exactly which fields you would like to see in your output. If you can provide an example of the output file you would like, that would be best. 


2. Please describe which assembly and which track you would like to get output for.

I'm not sure if this answers your question, but here is a previously answered question which described how to select fields from related tables for Table Browser output, such as gene alias and exon count.

Also, here is documentation about the genePred file format. Most of the gene track tables are in genePred format. If you follow the steps in the previously answered question above, you will see an example of how to select and remove certain fields from your Table Browser output, including fields such as gene alias.

If your response contains confidential information, you can reply directly to me. Otherwise, please respond to this forum with more information so that our support team can provide specific help for you.

Please respond to this list if you have further questions!

Thank you for contacting the UCSC Genome Browser support team. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-announce+subscribe...@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsubscri...@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAPCekTtzR-vtzdFb05RQ%2B-nTfm_G%2BqpLVagRGskQH0zBdSUcxA%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Cath Tyner

unread,
Jul 14, 2017, 7:38:32 PM7/14/17
to Britta M, UCSC Genome Browser Public Help Forum
Hello again Britta,

Thank you for writing back and including clarification on your question. I will provide some steps below which I hope will help you meet your goal. If I have misunderstood your goal, or if you have further questions, please feel free to write back again!

Converting genePred to GTF
You can convert genePred to GTF by following the instructions here: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

Once you have downloaded the genePredToGtf utility (and given execute permissions), you can run the command:

$ genePredToGtf

Which will output the usage statement.
You can then use the command:

$ genePredToGtf gasAcu1 ensGene ensGene.gtf

Which will provide output like this by connecting to the UCSC Genome Browser database/tables:
$ genePredToGtf gasAcu1 ensGene ensGene.gtf

$head ensGene.gtf
chrI    ensGene    transcript    2098    9910    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1";  gene_name "ENSGACG00000004395.1";
chrI    ensGene    exon    2098    2414    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "1"; exon_id "ENSGACT00000005823.1.1"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    CDS    2101    2414    .    -    2    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "1"; exon_id "ENSGACT00000005823.1.1"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    exon    3555    3701    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "2"; exon_id "ENSGACT00000005823.1.2"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    CDS    3555    3701    .    -    2    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "2"; exon_id "ENSGACT00000005823.1.2"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    exon    4140    4191    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "3"; exon_id "ENSGACT00000005823.1.3"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    CDS    4140    4191    .    -    0    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "3"; exon_id "ENSGACT00000005823.1.3"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    exon    4196    4284    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "4"; exon_id "ENSGACT00000005823.1.4"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    CDS    4196    4284    .    -    2    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "4"; exon_id "ENSGACT00000005823.1.4"; gene_name "ENSGACG00000004395.1";
chrI    ensGene    exon    4367    4515    .    -    .    gene_id "ENSGACG00000004395.1"; transcript_id "ENSGACT00000005823.1"; exon_number "5"; exon_id "ENSGACT00000005823.1.5"; gene_name "ENSGACG00000004395.1";
Note that the gene name is included in the gtf output because this ensGene table includes the extended file format (fields beyond the genPred format).

This gtf file of the ensGene table for gasAcu1 is available for you to download here:
http://hgwdev.cse.ucsc.edu/~cath/temp/gasAcu1ensGene.gtf

Or, if you only want chrUn for gasAcu1, use this file:
http://hgwdev.cse.ucsc.edu/~cath/temp/ensGeneChrUnOnly.gtf

Table Browser output examples
If the above gtf file is not what you want, you can customize fields in the Table Browser. Below are examples to get all fields for the ensGene table, as well as an example to add/remove fields.

1. Go to the Table Browser at the UCSC Genome Browser and configure as follows:
http://genome.ucsc.edu/cgi-bin/hgTables

clade: Vertebrate
genome: Stickleback
assembly: gasAcu1
group: Genes & Gene Predictions
track: Ensembl Genes
table: ensGene
region: you can select "genome" or a particular region, such as "chrUn:1-62550211" for the entire chrUn.

The ensGene table stores values in the genePred extended data format. https://genome.ucsc.edu/FAQ/FAQformat.html#format9
In the Table Browser, you can click on "describe table schema" to see the table description:
http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doSchemaDb=gasAcu1&hgta_doSchemaTable=ensGene

2. After viewing the table schema, if there are other fields you would like to add, or if there are fields you would like to remove from the output, change the "output format" drop-down to "selected fields from primary and related tables."

Otherwise, if you want all fields in the table and have no other fields to add, keep the "output format" drop-down as "all fields from selected table."

If you selected "all fields from selected table" from "output format" you will get output like this:
#bin    name    chrom    strand    txStart    txEnd    cdsStart    cdsEnd    exonCount    exonStarts    exonEnds    score    name2    cdsStartStat    cdsEndStat    exonFrames
585    ENSGACT00000000388.1    chrUn    +    36047    39906    36157    39494    3    36047,38740,39113,    36615,39035,39906,    0    ENSGACG00000000302.1    cmpl    cmpl    0,2,0,
585    ENSGACT00000000389.1    chrUn    -    48404    52111    49283    52078    4    48404,50303,51073,51945,    49295,50525,51192,52111,    0    ENSGACG00000000307.1    incmpl    cmpl    0,0,1,0,
585    ENSGACT00000000393.1    chrUn    +    84626    88602    84637    88230    15    84626,84826,85050,85317,85506,85668,85910,86262,86519,86721,87016,87355,87673,87996,88138,    84650,84885,85191,85431,85587,85804,86014,86413,86607,86899,87159,87492,87840,88051,88602,    0    ENSGACG00000000309.1    cmpl    cmpl    0,1,0,0,0,0,1,0,1,2,0,2,1,0,1,
Note: to get this output in the browser, leave the "output file" field blank. To get output as a downloaded file, enter a name for the file in that field.

If, however, you would like to add or remove fields from either the ensGene table or from a related table, you can follow the example below:

Change the "output format" drop-down to "selected fields from primary and related tables."
Click "get output" to go to the next step.
Select the fields that you would like from the ensGene table. In this example, I'll select all fields except for "bin."

If you don't want to include fields from other tables, you can now click, "get output."
Otherwise, under the section, "Linked Tables" select 1 or more linked tables and then click, "allow selection from checked tables".

You can select all of the related tables for now, and then simply choose whichever fields you want to add. For this example, I'll add the following table.fields and then click "get output".

- ensemblSource.source
- ensemblToGeneName.value

You will now get output like this, where we've removed "ensGene.bin" and added "ensGtp.gene":
#gasAcu1.ensGene.name    gasAcu1.ensGene.chrom    gasAcu1.ensGene.strand    gasAcu1.ensGene.txStart    gasAcu1.ensGene.txEnd    gasAcu1.ensGene.cdsStart    gasAcu1.ensGene.cdsEnd    gasAcu1.ensGene.exonCount    gasAcu1.ensGene.exonStarts    gasAcu1.ensGene.exonEnds    gasAcu1.ensGene.score    gasAcu1.ensGene.name2    gasAcu1.ensGene.cdsStartStat    gasAcu1.ensGene.cdsEndStat    gasAcu1.ensGene.exonFrames    gasAcu1.ensemblSource.source    gasAcu1.ensemblToGeneName.value
ENSGACT00000000388.1    chrUn    +    36047    39906    36157    39494    3    36047,38740,39113,    36615,39035,39906,    0    ENSGACG00000000302.1    cmpl    cmpl    0,2,0,    protein_coding    ADORA1 (1 of 2)
ENSGACT00000000389.1    chrUn    -    48404    52111    49283    52078    4    48404,50303,51073,51945,    49295,50525,51192,52111,    0    ENSGACG00000000307.1    incmpl    cmpl    0,0,1,0,    protein_coding    n/a
ENSGACT00000000393.1    chrUn    +    84626    88602    84637    88230    15    84626,84826,85050,85317,85506,85668,85910,86262,86519,86721,87016,87355,87673,87996,88138,    84650,84885,85191,85431,85587,85804,86014,86413,86607,86899,87159,87492,87840,88051,88602,    0    ENSGACG00000000309.1    cmpl    cmpl    0,1,0,0,0,0,1,0,1,2,0,2,1,0,1,    protein_coding    gnl3
Please let me know if this does not meet your goal so that our support team can provide further assistance!

Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


Reply all
Reply to author
Forward
0 new messages