Good morning!!
I am trying to create a BED file with the common gene name, canonical transcript, and exon structure. Unfortunately, when I select just for the BED format with knownCanonical I don’t get the exons and when I select BED format for knownGene I get a different gene identifier than the common gene name. I have tried to pick several characteristics from the different tables and I can get everything I want except the exon sizes!!! Can you please tell me which table will give me the exon sizes?
I could write a program or cull this data from subtracting the exon starts from the exon ends but I was just hoping I could get the sizes from the tables.
Thank you~
Jen
Jennifer Hauenstein
Emory Healthcare
Oncology Cytogenetics
Hello Jen,
Thank you for using the Genome Browser and for your question regarding Table Browser outputs.
Unfortunately, there is no direct Table Browser query that will get you all this information in one go. I will outline some steps that will provide you with a BED file that includes the exon sizes and the gene symbol. I will assume you are working with the hg38 assembly; if that is not the case the steps would be very similar with hg19.
I will break the process into four steps:
1. Get a list of knownCanonical transcripts
2. Produce a BED file from canonical transcripts
3. Generate a gene symbols file for canonical transcripts
4. Join the BED file with the gene symbols file
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BN7PR05MB43240580D888F9C2DB555497E0810%40BN7PR05MB4324.namprd05.prod.outlook.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.
Thank you Lou!! This will be helpful in the future. I did figure out how to do it with some similar steps below in Galaxy…joining the tables in Galaxy. I did start writing a python program but you can’t beat the convenience of Galaxy!!
One more question, if I used the Sql access to the databases would I be able to get the exon sizes and exon start positions?
Thank you so much for your assistance!
Jen
Hello Jen,
I'm glad you were able to join the tables with Galaxy, it's a great tool.
As far as using MySQL to access our databases directly, you can certainly do that. We have a public MySQL server which hosts most of our data tables. In order to extract the exon sizes and start positions, you could query our knownGene table, which is the same I used in the previous Table Browser example, and select the exonStarts and exonEnds columns.
Below is an example of this query with the inclusion of the transcript (name), and the number of exons (exonCount):
We have a blog post which offers additional examples and more information about accessing the Genome Browser through MySQL: http://genome.ucsc.edu/blog/accessing-the-genome-browser-programmatically-part-2-using-the-public-mysql-server-and-gbdb-system/
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
Hello Jen,
We wanted to follow up to clarify some concepts.
We do not have any tables that explicitly store exon sizes, though they can be computed using the start/end columns. It is worth noting however that the knownGene table (from the previous example), where these coordinates are pulled from, is in genePred format, and there are some considerations if you are trying to produce a BED file.
In genePred exonStarts (and exonEnds) coordinates are absolute, while BED blockStarts coordinates are relative to chromStart -- so BED's blockStarts always starts with "0," (i.e. first item in blockStarts is equal to overall chromStart). When you use the Table Browser "Output as BED" it automatically makes these conversions for you. This is worth keeping in mind if you are trying to put together your own BED files querying our SQL data tables.
We also have a program in our utilities directory, genePredToBed that will automatically convert from genePred format to BED format. You can find that directory here: http://hgdownload.soe.ucsc.edu/admin/exe/
Lou Nassar
UCSC Genomics Institute
Lou!!
Thank you for all your help! I actually ended up writing a python program that would take output from the knownCanonical table, from a common gene name list, and calculate the exon sizes, exon starts, and then rearrange the columns into BED format! It was a great learning experience for me!
Thank you again!