Exon sizes in which table?

Hauenstein, Jennifer

unread,

Jan 15, 2019, 11:54:12 AM1/15/19

to gen...@soe.ucsc.edu

Good morning!!

I am trying to create a BED file with the common gene name, canonical transcript, and exon structure. Unfortunately, when I select just for the BED format with knownCanonical I don’t get the exons and when I select BED format for knownGene I get a different gene identifier than the common gene name. I have tried to pick several characteristics from the different tables and I can get everything I want except the exon sizes!!! Can you please tell me which table will give me the exon sizes?

I could write a program or cull this data from subtracting the exon starts from the exon ends but I was just hoping I could get the sizes from the tables.

Thank you~

Jen

Jennifer Hauenstein

Emory Healthcare

Oncology Cytogenetics

404-712-5833

This e-mail message (including any attachments) is for the sole use of
the intended recipient(s) and may contain confidential and privileged
information. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message (including any attachments) is strictly
prohibited.

If you have received this message in error, please contact
the sender by reply e-mail message and destroy all copies of the
original message (including attachments).

Luis Nassar

unread,

Jan 17, 2019, 12:53:30 PM1/17/19

to Hauenstein, Jennifer, gen...@soe.ucsc.edu

Hello Jen,

Thank you for using the Genome Browser and for your question regarding Table Browser outputs.

Unfortunately, there is no direct Table Browser query that will get you all this information in one go. I will outline some steps that will provide you with a BED file that includes the exon sizes and the gene symbol. I will assume you are working with the hg38 assembly; if that is not the case the steps would be very similar with hg19.

I will break the process into four steps:

1. Get a list of knownCanonical transcripts

In the Table Browser select the GENCODE v29 track and knownCanonical table
For output format choose "Selected fields..."
Choose output file name, for this example "CanonicalTranscripts.txt"
Get output: Then select "transcript" from the first table -> get output

2. Produce a BED file from canonical transcripts

Go back to the Table Browser and select the knownGene table in the GENCODE v29 track
In the "identifiers (names/accessions):" option, select "upload list" and upload the CanonicalTranscripts.txt file (you will see an error related to the header line '#transcript', it is safe to ignore)
Select output format "BED..." and a file name, for this example "CanonicalTranscriptsBED.txt"
Get output -> get BED

3. Generate a gene symbols file for canonical transcripts

(First two steps same as above) Go back to Table Browser and select the knownGene table in the GENCODE v29 track
In the "identifiers (names/accessions):" option, select "upload list" and upload the CanonicalTranscripts.txt file (you will see an error related to the header line '#transcript', it is safe to ignore)
Select output format "Selected fields..." and choose a file name, for this example "CanonicalGeneSymbols.txt"
Select "name" from the first hg38.knownGene table, and "geneSymbol" from the hg38.kgXref table -> get output

4. Join the BED file with the gene symbols file

Using shell script, you can add the gene symbols to the end of your BED file with the following command:

$ join -1 4 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.2  <(sort -k4 CanonicalTranscriptsBED.txt) <(sort -k1 CanonicalGeneSymbols.txt) > CanonicalTranscriptGeneNames.bed

With the '-o' flag you can specify which fields of the BED file you wish to keep. The output of the above command should be the BED file the exon sizes as well as the gene symbols. Below is a snippet of the first two entires in the BED file.

chr1 24357004 24413725 ENST00000003583.12 0 - 24358542 24401366 0 8 1615,191,166,109,171,102,138,52, 0,3846,12669,16697,22648,26897,44314,56669, STPG1
chr1 209583716 209613938 ENST00000009105.5 0 + 209594983 209612875 0 13 216,121,129,75,139,124,76,113,79,88,425,128,899, 0,11238,16266,19497,21819,22603,24141,25263,26134,27748,28075,29068,29323, CAMK1G

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BN7PR05MB43240580D888F9C2DB555497E0810%40BN7PR05MB4324.namprd05.prod.outlook.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Hauenstein, Jennifer

unread,

Jan 17, 2019, 4:09:49 PM1/17/19

to Luis Nassar, gen...@soe.ucsc.edu

Thank you Lou!! This will be helpful in the future. I did figure out how to do it with some similar steps below in Galaxy…joining the tables in Galaxy. I did start writing a python program but you can’t beat the convenience of Galaxy!!

One more question, if I used the Sql access to the databases would I be able to get the exon sizes and exon start positions?

Thank you so much for your assistance!

Jen

Luis Nassar

unread,

Jan 18, 2019, 12:04:52 PM1/18/19

to Hauenstein, Jennifer, gen...@soe.ucsc.edu

Hello Jen,

I'm glad you were able to join the tables with Galaxy, it's a great tool.

As far as using MySQL to access our databases directly, you can certainly do that. We have a public MySQL server which hosts most of our data tables. In order to extract the exon sizes and start positions, you could query our knownGene table, which is the same I used in the previous Table Browser example, and select the exonStarts and exonEnds columns.

Below is an example of this query with the inclusion of the transcript (name), and the number of exons (exonCount):

mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e "select name,exonCount,exonStarts,exonEnds from knownGene limit 5" hg38
+-------------------+-----------+--------------------+--------------------+
| name              | exonCount | exonStarts         | exonEnds           |
+-------------------+-----------+--------------------+--------------------+
| ENST00000619216.1 |         1 | 17368,             | 17436,             |
| ENST00000473358.1 |         3 | 29553,30563,30975, | 30039,30667,31097, |
| ENST00000469289.1 |         2 | 30266,30975,       | 30667,31109,       |
| ENST00000607096.1 |         1 | 30365,             | 30503,             |
| ENST00000417324.1 |         3 | 34553,35276,35720, | 35174,35481,36081, |
+-------------------+-----------+--------------------+--------------------+

We have a blog post which offers additional examples and more information about accessing the Genome Browser through MySQL: http://genome.ucsc.edu/blog/accessing-the-genome-browser-programmatically-part-2-using-the-public-mysql-server-and-gbdb-system/

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

Luis Nassar

unread,

Jan 18, 2019, 2:38:52 PM1/18/19

to Hauenstein, Jennifer, gen...@soe.ucsc.edu

Hello Jen,

We wanted to follow up to clarify some concepts.

We do not have any tables that explicitly store exon sizes, though they can be computed using the start/end columns. It is worth noting however that the knownGene table (from the previous example), where these coordinates are pulled from, is in genePred format, and there are some considerations if you are trying to produce a BED file.

In genePred exonStarts (and exonEnds) coordinates are absolute, while BED blockStarts coordinates are relative to chromStart -- so BED's blockStarts always starts with "0," (i.e. first item in blockStarts is equal to overall chromStart). When you use the Table Browser "Output as BED" it automatically makes these conversions for you. This is worth keeping in mind if you are trying to put together your own BED files querying our SQL data tables.

We also have a program in our utilities directory, genePredToBed that will automatically convert from genePred format to BED format. You can find that directory here: http://hgdownload.soe.ucsc.edu/admin/exe/

Lou Nassar
UCSC Genomics Institute

Hauenstein, Jennifer

unread,

Jan 22, 2019, 1:48:03 PM1/22/19

to Luis Nassar, gen...@soe.ucsc.edu

Lou!!

Thank you for all your help! I actually ended up writing a python program that would take output from the knownCanonical table, from a common gene name list, and calculate the exon sizes, exon starts, and then rearrange the columns into BED format! It was a great learning experience for me!

Thank you again!

Reply all

Reply to author

Forward