Coding Exons and First Exons

Beth Marosy

unread,

Aug 13, 2013, 4:21:09 PM8/13/13

to gen...@soe.ucsc.edu

Hi,

I have been using the table browser to export coding exons, and have also used galaxy to do the same thing (i.e. export whole gene from the table browser, import into galaxy and then extract the coding exons). I’ve noticed differences between what is determined as coding between these two files (i.e. UCSC extracted coding exons vs Galaxy extracted coding exons). I can’t quite figure out what the differences might be? Any advice or insight you may have would be greatly appreciated.

Also, is there a way to determine the first exon from the UCSC exported information?

Best,
Beth Marosy

Steve Heitner

unread,

Aug 13, 2013, 6:30:01 PM8/13/13

to Beth Marosy, gen...@soe.ucsc.edu

Hello, Beth.

Could you please provide a specific example of a gene whose results were different between UCSC and Galaxy? Please also provide a detailed list of the steps you took, both on UCSC and Galaxy, that produced the conflicting information.

I’m not certain if your final question is asking how to determine the first exon using the UCSC Table Browser or how to determine the first exon using Galaxy with UCSC data. If you’re asking how to do it via the Table Browser, you would select “sequence” as your output type. Once you click the “get output” button, you would select “genomic” as your sequence type. After clicking the “submit” button, the following screen allows you to specify whether you want to view 5’ UTR exons, CDS exons, 3’ UTR exons, etc. If you want to view each exon individually, it is important to make sure you also select the “One FASTA record per region” option.

If you are asking for support with Galaxy tools, please contact Galaxy support at http://wiki.galaxyproject.org/Support.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Beth Marosy

unread,

Aug 14, 2013, 12:03:01 PM8/14/13

to st...@soe.ucsc.edu, gen...@soe.ucsc.edu

Hi Steve,

Thanks for your response.

Here are two examples:

chr1:134,773-140,566 LOC729737, is listed as an uncharacterized non-coding RNA – it was included in the UCSC exon extraction and not the Galaxy exon extraction

chr1:1,189,292-1,209,234 UBE2J2, is listed as a coding gene - it was included in the Galaxy exon extraction and not the UCSC exon extraction

The steps I used are as follows:

For the UCSC exon extraction – in the Table Browser GUI, select the UCSC Genes track and knownGene for table, define position as chr1:1-249250621 (I’ve tried selecting the genome option, but when downloading it times out. So I have had to resort to entering the coordinates for each chromosome, then concatenate the files to get the genome). Output format = BED browser extensible data, enter filename for output file, click on get output. In the next window select coding exons, then click get BED.

For the Galaxy exon extraction – in the Table Browser GUI, select the UCSC Genes track and knownGene for table, select genome, output format = BED browser extensible data, enter filename for output file, click on get output. In the next window select whole gene, then click get BED. Import file to Galaxy using get data – upload file. Select Extract Features, Gene BED to Exon/Intron/Codon BED expander. Set Extract = coding exons only and pick file, click on execute.

As for the First Exon, I am looking to identify (chr/start/stop) which exons are considered ‘first exons’ (i.e. first exon/coding exon present in the mature mRNA) across the genome, but need a way to accurately come up with this list. I noticed in the UCSC exon extraction file there is an id (e.g. chr10 92996 94054 uc001ifi.2_cds_0_0_chr10_92997_r 0 -, with ‘uc001ifi.2_cds_0_0_chr10_92997_r’ as the ID). Does the “0” after the cds_ reflect an exon count and is 0 always considered the first exon? Or is there a different/better way to determine this?

Thanks for your help. Please let me know if I can provide you with additional information.

Best,
Beth

Steve Heitner

unread,

Aug 14, 2013, 3:58:14 PM8/14/13

to Beth Marosy, gen...@soe.ucsc.edu

Hello, Beth.

I just replicated the steps that you used to extract coding exons, both at UCSC and Galaxy, and I successfully obtained the coding exons with both methods.  To obtain all of the output below, I performed my Table Browser queries using all of chr1 as the region.  I am only listing the lines specific to the genes you referenced.  I obtained the following for LOC729737 (I manually inserted the pipes for readability):

Whole gene query for Galaxy:
chr1 | 134772 | 140566 | uc021oeg.2 | 0 | - | 138529 | 139792 | 0 | 3 | 4924,58,492, | 0,5017,5302,

Coding exon query for UCSC:
chr1 | 138529 | 139696 | uc021oeg.2_cds_0_0_chr1_138530_r | 0 | -
chr1 | 139789 | 139792 | uc021oeg.2_cds_1_0_chr1_139790_r | 0 | -

I obtained the following for UBE2J2:

Whole gene query for Galaxy:
chr1 | 1189291 | 1203372 | uc001adm.4 | 0 | - | 1190582 | 1198741 | 0 | 7 | 1576,81,139,103,41,193,131, | 0,2133,3080,3296,9434,12186,13950,
chr1 | 1189291 | 1209234 | uc001ado.3 | 0 | - | 1190582 | 1203372 | 0 | 8 | 1576,81,139,103,41,48,131,189, | 0,2133,3080,3296,9434,10871,13950,19754,
chr1 | 1189291 | 1209234 | uc001adp.3 | 0 | - | 1190582 | 1203372 | 0 | 7 | 1576,81,139,103,41,131,189, | 0,2133,3080,3296,9434,13950,19754,
chr1 | 1189291 | 1209234 | uc001adq.3 | 0 | - | 1190582 | 1198741 | 0 | 7 | 1576,81,139,103,41,260,189, | 0,2133,3080,3296,9434,13821,19754,
chr1 | 1189291 | 1209234 | uc001adr.3 | 0 | - | 1190582 | 1198741 | 0 | 6 | 1576,81,139,103,41,189, | 0,2133,3080,3296,9434,19754,

Coding exon query for UCSC:
chr1 | 1190582 | 1190867 | uc001adm.4_cds_0_0_chr1_1190583_r | 0 | -
chr1 | 1191424 | 1191505 | uc001adm.4_cds_1_0_chr1_1191425_r | 0 | -
chr1 | 1192371 | 1192510 | uc001adm.4_cds_2_0_chr1_1192372_r | 0 | -
chr1 | 1192587 | 1192690 | uc001adm.4_cds_3_0_chr1_1192588_r | 0 | -
chr1 | 1198725 | 1198741 | uc001adm.4_cds_4_0_chr1_1198726_r | 0 | -
chr1 | 1190582 | 1190867 | uc001ado.3_cds_0_0_chr1_1190583_r | 0 | -
chr1 | 1191424 | 1191505 | uc001ado.3_cds_1_0_chr1_1191425_r | 0 | -
chr1 | 1192371 | 1192510 | uc001ado.3_cds_2_0_chr1_1192372_r | 0 | -
chr1 | 1192587 | 1192690 | uc001ado.3_cds_3_0_chr1_1192588_r | 0 | -
chr1 | 1198725 | 1198766 | uc001ado.3_cds_4_0_chr1_1198726_r | 0 | -
chr1 | 1200162 | 1200210 | uc001ado.3_cds_5_0_chr1_1200163_r | 0 | -
chr1 | 1203241 | 1203372 | uc001ado.3_cds_6_0_chr1_1203242_r | 0 | -
chr1 | 1190582 | 1190867 | uc001adp.3_cds_0_0_chr1_1190583_r | 0 | -
chr1 | 1191424 | 1191505 | uc001adp.3_cds_1_0_chr1_1191425_r | 0 | -
chr1 | 1192371 | 1192510 | uc001adp.3_cds_2_0_chr1_1192372_r | 0 | -
chr1 | 1192587 | 1192690 | uc001adp.3_cds_3_0_chr1_1192588_r | 0 | -
chr1 | 1198725 | 1198766 | uc001adp.3_cds_4_0_chr1_1198726_r | 0 | -
chr1 | 1203241 | 1203372 | uc001adp.3_cds_5_0_chr1_1203242_r | 0 | -
chr1 | 1190582 | 1190867 | uc001adq.3_cds_0_0_chr1_1190583_r | 0 | -
chr1 | 1191424 | 1191505 | uc001adq.3_cds_1_0_chr1_1191425_r | 0 | -
chr1 | 1192371 | 1192510 | uc001adq.3_cds_2_0_chr1_1192372_r | 0 | -
chr1 | 1192587 | 1192690 | uc001adq.3_cds_3_0_chr1_1192588_r | 0 | -
chr1 | 1198725 | 1198741 | uc001adq.3_cds_4_0_chr1_1198726_r | 0 | -
chr1 | 1190582 | 1190867 | uc001adr.3_cds_0_0_chr1_1190583_r | 0 | -
chr1 | 1191424 | 1191505 | uc001adr.3_cds_1_0_chr1_1191425_r | 0 | -
chr1 | 1192371 | 1192510 | uc001adr.3_cds_2_0_chr1_1192372_r | 0 | -
chr1 | 1192587 | 1192690 | uc001adr.3_cds_3_0_chr1_1192588_r | 0 | -
chr1 | 1198725 | 1198741 | uc001adr.3_cds_4_0_chr1_1198726_r | 0 | -

When visualizing both methods as custom tracks in the Browser, I saw exactly what I expected to see for both gene regions when viewing them alongside the UCSC Genes track.  Both custom tracks were identical and only the coding exons were visible.

Based on the steps you described to me, it sounds like you’re doing the right thing, so I’m not certain why you would be getting erroneous results for these two gene regions (or any others for that matter).  To eliminate the possibility that you are somehow inheriting previous query settings, I would perform a cart reset.  You can do this from the main Table Browser screen by clicking the “click here” link just below the “get output” button.

Regarding your question concerning first exons, yes, when querying coding exons only, the “0” immediately following the “cds_” in the ID always indicates the first coding exon.



Please contact us again at gen...@soe.ucsc.edu if you have any further questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group

Beth Marosy

unread,

Aug 16, 2013, 4:52:38 PM8/16/13

to st...@soe.ucsc.edu, gen...@soe.ucsc.edu

Hi Steve,

Thanks so much for replicating this. I have found that the file I downloaded from UCSC is ‘missing’ information that you have here. What concerns me is that I performed a reset cart before downloading this data set. I downloaded a second data set, again resetting the cart prior to using the table browser (and even rebooted my computer beforehand). This time I got 2,000 more rows in the dataset than previously! This happened once before (so three sets downloaded, with three different outputs). Is there some way to confirm that I have accurately downloaded the data set? I realize there are many options and filters folks may use, so it would be impossible to assure that the download was complete, but in this case I am attempting to download the full genome, either as a whole gene or as coding exons. Any advice? Is it possible that the internet browser used makes a difference? Typically I use FireFox, but recently it has been getting hung up and I’ve switched back to IE9.

Many thanks for your assistance!

Brian Lee

unread,

Aug 16, 2013, 7:59:41 PM8/16/13

to Beth Marosy, st...@soe.ucsc.edu, gen...@soe.ucsc.edu

Dear Beth,

Thank you for using the UCSC Genome Browser and your question about using the Table Browser to download coding exons for the entire genome.

It is likely that the error you are experiencing is from the web connection timing out, but without indicating such an error. Usually gene tracks do not experience this problem, unless additional filters are slowing the extraction of data, perhaps the case here.

Can you reply with more information, as we couldn't reproduce your error as seen in the last response. When you are getting what you think is the problem, it could help us to see what your cart is displaying. When the problem arrises you can change your URL from "http://genome.ucsc.edu/cgi-bin/hgTables...." to "http://genome.ucsc.edu/cgi-bin/cartDump" and you your screen will show settings that might give us clues to what is going on. You can send a copy of this information as a text file off the list directly to my email, bria...@soe.ucsc.edu.

One solution is to download the entire table as a BED file, (In the Table Browser select BED as output for the entire genome and give a file name like "knownGeneAsBed") and then try running the command line utilities bedToExons and bedGeneParts to split up the BED file representation of the table. Be sure to remove the top custom track line before trying the utilities on your downloaded table.

These utilities are accessible for download here under the appropriate operating system matching your computer (you can run "uname -a" to determine your machine type): http://hgdownload.cse.ucsc.edu/admin/exe/

Running each on the command line will provide indications of the options (remember to chmod to executable):

$ ./bedGeneParts
bedGeneParts - Given a bed, spit out promoter, first exon, or all introns.
usage:
bedGeneParts part in.bed out.bed
Where part is either 'exons' or 'firstExon' or 'introns' or 'promoter' or 'firstCodingSplice'
or 'secondCodingSplice'
options:
-proStart=NN - start of promoter relative to txStart, default -100
-proEnd=NN - end of promoter relative to txStart, default 50

$ ./bedToExons
bedToExons - Split a bed up into individual beds.
One for each internal exon.
usage:
bedToExons originalBeds.bed splitBeds.bed
options:
-cdsOnly - Only output the coding portions of exons.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have further questions, please feel free to contact the mailing list again at gen...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group

--

Beth Marosy

unread,

Aug 29, 2013, 1:23:52 PM8/29/13

to Brian Lee, st...@soe.ucsc.edu, gen...@soe.ucsc.edu

Hi Brian,

Thanks for your response. It seems my dilemma is the files that I have downloaded direct from the table browser (i.e. the entire table/whole gene bed file for the human genome) are different each time I’ve downloaded them. If I’ve refreshed the cart and hope that there is no web connection timing out, is there some way to confirm that what was downloaded is correct? Otherwise, I would have to re-download everyday to see if it is different from the previous download to know that it was wrong to begin with, but then I cant capture the screen settings for troubleshooting from the previous download. I need to somehow be able to confirm that I have successfully downloaded the content, otherwise there is no way for me to ensure that I have everything, and downstream analysis using this file will be compromised if I cant assuredly and reproducibly download the same file.

I was able to use the bedtoExons and bedGeneParts tools you suggested! That was fabulous!! Thanks much.

Best,
Beth

Luvina Guruvadoo

unread,

Aug 29, 2013, 6:52:42 PM8/29/13

to Beth Marosy, gen...@soe.ucsc.edu

Hi Beth,

There is a way to detect if your file was downloaded correctly. First, save your Table Browser output to a file (ex. "tableBrowserResults"), then select "gzip compressed" as the file type returned. Next, run this from the command line to check that everything was downloaded completely:

gzip -t tableBrowserResults.gz

The -t is used to test the integrity of the gzip file. If there were any errors or the file was not downloaded properly, then gzip -t will complain of a problem.

I hope this helps. If you have further questions or comments, please reply to gen...@soe.ucsc.edu.

---
Luvina Guruvadoo
UCSC Genome Bioinformatics Group

--

Reply all

Reply to author

Forward