Dear Shanker Swaminathan,
Thank you for using the UCSC Genome Browser and your question about
using the table browser to obtain hg19 gene coordinates.
You can use the table browser to pull the start and end positions and gene
symbol from RefSeq, but the table browser can not perform the additional
steps you desire. You may be pleased to learn, however, that you can use
Galaxy's tools,
https://main.g2.bx.psu.edu, to perform complex steps
to merge your data.
If you are unfamiliar with Galaxy, their tutorial video may be a good
place to start:
https://main.g2.bx.psu.edu/u/aun1/p/galaxy101. If you
have any questions regarding Galaxy tools, please direct them to their
help desk:
http://wiki.galaxyproject.org/Support.
1. Access the table browser from Galaxy's website,
https://main.g2.bx.psu.edu,
click on the "Get Data" link in the top left and then the UCSC Main
table browser link.
Use the following settings:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Prediction Tracks
Track: RefSeq Genes
Table: refGene
2. Be sure the region is set to genome.
3. Select for Output Format: "Selected Fields from Primary and Related Tables".
4. The box next to "Send output to Galaxy" should be checked, click
"get output".
5. Now select the fields "chrom" "cdsStart" "cdsEnd" and "name2" from
the refGene table and click "done with selection" then "send query to
galaxy".
6. On the right side of your Galaxy session you can now see your data
by clicking the eye icon, you will have something like:
chr1 48999844 50489468 AGBL4
chr1 16767256 16785385 NECAP2
chr1 16767256 16785385 NECAP2
chr1 16767256 16785491 NECAP2
Note that isoforms here have the same name, because we selected the
"name2" field which is not a unique identifier.
7. From Galaxy, on the left hand side of your Galaxy session you can
select "Operate on Genomic Intervals" and then click "Merge".
Unfortunately you may merge two or more genes together if they overlap
with this operation.
8. Set your input data to the target and select "Execute".
9. You now will have a new data set in Galaxy that will look like the following:
chr21 30248706 30257667
chr21 30302769 30365264
You may want to inquire with Galaxy if there is way to retain the
name2 column information, or if there is a better way to use their
tools to get your desired results:
http://wiki.galaxyproject.org/Support
10. One way to bring back the gene names is that you can join this
table to the original data. Under "Operate on Genomic Intervals" click
"Join", setting the merge output first and the original data second,
you will get output like:
chrY 9195451 9368097 chrY 9304609 9307170 TSPY1
chrY 9195451 9368097 chrY 9236075 9368097 TSPY4
chrY 9195451 9368097 chrY 9195451 9218292 TSPY4
chrY 9195451 9368097 chrY 9195451 9198014 TSPY8
chrY 9195451 9368097 chrY 9236075 9238638 TSPY3
chrY 9195451 9368097 chrY 9215730 9238638 TSPY4
This shows the merge information on the left and the various isoforms
on the right. You can see in the above example here there are
overlapping regions that were merged but have different gene names.
11. You could select Galaxy's "Text Manipulation" tools and the "Cut"
tool to pull out the first three columns and the last name column to
result with information about the merge such as this:
chrY 9195451 9368097 TSPY1
chrY 9195451 9368097 TSPY4
chrY 9195451 9368097 TSPY4
chrY 9195451 9368097 TSPY8
chrY 9195451 9368097 TSPY3
chrY 9195451 9368097 TSPY4
One last note, you may want to use the UCSC genes knownCanonical
table, they are genes that are clustered based on if they have
overlapping exons, then within a cluster, the gene with the highest
number of coding bases is chosen as the representative canonical gene.
1. Navigate to the table browser, and make the following selections
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Prediction Tracks
Track: UCSC Genes
Table: knownCanonical
2. Be sure to have the region set to "genome".
3. Select "Selected fields from primary and related tables" in the
Output Format, and click "get output".
4. Select "chrom" "chromStart" "chromEnd" from the top knownCanonical
table, and also "geneSymbol" from the next "hg19.kgXref" table.
5. Click the "get output" button, you will get output like:
chr1 367658 368597 OR4F16
chr1 420205 421839 BC036251
chr1 566092 566115 JA429830
chr1 566134 566155 JA429831
chr1 566239 566263 JA429505
Thank you again for your inquiry and using the UCSC Genome Browser,
if you have further questions please feel free to contact the mailing list
again at
gen...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genome Bioinformatics Group
> --
>
>
>