mapping protein domains to genomic coordinates

1,139 views
Skip to first unread message

frederic lepretre

unread,
Oct 24, 2012, 4:28:44 AM10/24/12
to gen...@soe.ucsc.edu
Hi,

I know it's an old war but I cannot get informations on a database/table
that could help me in my research.
what I would like is to get for a gene its genomic coordinates for each
of its associated protein.
in fact, I'd like to be able to extract a table that looks like:

Chrm startPos endPos geneName DomainNAME databaseNAME
chrX 100250 100550 XXX Tyrosine-Kinase pFam

but impossible to get the access to these informations together.
could you help me in that purpose, thanks.
fred




--
*********************************************************
* Dr Frédéric Leprêtre *
* Institut pour la Recherche sur le Cancer de Lille *
* Plateforme de génomique fonctionnelle *
* Agilent - Affymetrix - SoliD4 - PGM *
* Place de Verdun 59045 Lille cedex *
* *
* tel. 03 20 16 92 20 poste 2339 *
* fax. 03 20 16 92 29 *
* *
* http://www.ircl.org/plate-forme-genomique.html *
* http://www.univ-lille2.fr/ *
* *
* frederic...@inserm.fr *
* ou *
* frederic...@univ-lille2.fr *
*********************************************************

Brooke Rhead

unread,
Oct 25, 2012, 9:34:20 PM10/25/12
to frederic lepretre, gen...@soe.ucsc.edu
Hi Fred,

There is protein domain information displayed on items in the UCSC Genes
track from InterPro, Pfam, and SCOP that we can help you extract from
our databases. (For instance, see:
http://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc002ypa.3&hgg_prot=P00441&hgg_chrom=chr21&hgg_start=33031934&hgg_end=33041243&hgg_type=knownGene&db=hg19#domains)

The Pfam and SCOP domains are relatively easy to get from the Table
Browser (http://genome.ucsc.edu/cgi-bin/hgTables). There are problems
with extracting the InterPro domains using the Table Browser (the query
is complicated and often times out), but you can get them using our
public mysql server instead.

To get Pfam and SCOP domains along with genomic coordinates and gene
name, select in the Table Browser:

clade: mammal
genome: human
assembly: hg19
table: knownGene
track: UCSC Genes
table: knownGene
region: genome
output format: selected fields from primary and related tables
Also enter a name in the "output file" box if you would like the output
to go to a file.

Hit "get output," and on the next page, scroll down to the Linked Tables
section and select:

hg19 knownToPfam
hg19 kgProtMap2
(and leave hg19 kgXref selected)

then scroll to the bottom and hit "allow selection from primary and
related tables." Now there should be some extra tables displayed under
Linked Tables. Keep all of the former selections and also select:

hg19 pfamDesc
hg19 ucscScop

Hit the "allow selection" button again and select one more table:

hg19 scopDesc

then hit the "allow selection" button a final time. Now you can select
the fields you want in your output. You don't need to include a
selection from every table . . . some are just selected because the
desired info is in a related table. I suggest something like the
following (but you can include or exclude whatever fields you like):

from hg19.knownGene:
name
chrom
cdsStart (or txStart, for transcription start instead of coding start)
cdsEnd (or txEnd)

from hg19.kgXref:
geneSymbol

from hg19.pfamDesc:
pfamAC
pfamID
description

from hg19.scopDesc:
acc
name
description

Now hit "get output." You should get output that looks something like
this (this is from the default region on hg19, for the SOD1 gene):

> #hg19.knownGene.name hg19.knownGene.chrom hg19.knownGene.cdsStart hg19.knownGene.cdsEnd hg19.kgXref.geneSymbol hg19.pfamDesc.pfamAC hg19.pfamDesc.pfamID hg19.pfamDesc.description hg19.scopDesc.acc hg19.scopDesc.name hg19.scopDesc.description
> uc002ypa.3 chr21 33032082 33040891 SOD1 PF00080 Sod_Cu Copper/zinc superoxide dismutase (SODC) 49329 b.1.8 Cu,Zn superoxide dismutase-like

Not every isoform in UCSC Genes will have domain information. Many will
have "n/a" listed for the domain columns.

To get InterPro domains, you can connect to our MySQL server as
described here:

http://genome.ucsc.edu/goldenPath/help/mysql.html

and then use a MySQL query like the following:

select kgId,chrom,cdsStart,cdsEnd,geneSymbol,extAcc1,extAcc2 from
knownGene g, kgXref x, uniProt.extDb ue, uniProt.extDbRef ux where
x.kgId=g.name and x.spId=ux.acc and ux.extDb=ue.id and ue.val="Interpro"

You should get output like the following:

> +------------+-------+----------+----------+------------+-----------+-----------------------+
> | kgId | chrom | cdsStart | cdsEnd | geneSymbol | extAcc1 | extAcc2 |
> +------------+-------+----------+----------+------------+-----------+-----------------------+
> | uc002ypa.3 | chr21 | 33032082 | 33040891 | SOD1 | IPR024134 | SOD_Cu/Zn_/chaperones |
> | uc002ypa.3 | chr21 | 33032082 | 33040891 | SOD1 | IPR018152 | SOD_Cu/Zn_BS |
> | uc002ypa.3 | chr21 | 33032082 | 33040891 | SOD1 | IPR001424 | SOD_Cu_Zn_dom |
> +------------+-------+----------+----------+------------+-----------+-----------------------+

Note that the same UCSC Genes identifier (e.g., uc002ypa.3) will have
multiple lines for each different domain name in this output (unlike the
Table Browser output, which will have a comma-separated list for each
UCSC Genes identifier).

If you would like to join the output from the two different queries into
one large table, I recommend the Galaxy website
(https://main.g2.bx.psu.edu/), which is run by Penn State and works in
conjunction with the Table Browser. You can send your Table Browser
output to the site by selecting the "Send output to Galaxy" checkbox
instead of entering an output file name. Then you can upload a file
with your MySQL output, and use the "Join two Datasets" tool (under
"Join, Subtract and Group") to join your queries together. There are
also several tools in the "Text Manipulation" that you might find
useful. If you have any questions about how to use the Galaxy site,
their helpdesk address is galax...@lists.bx.psu.edu.

If you have any further questions for UCSC, please reply to
gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group
Reply all
Reply to author
Forward
0 new messages