GO-Terms and gene descriptions for mm9 RefSeq genes

29 views
Skip to first unread message

Andy Rampersaud

unread,
Dec 8, 2015, 3:41:37 PM12/8/15
to gen...@soe.ucsc.edu
Hi,

I'm working with the mm9 assembly and a list of 24K RefSeq Genes.  For each of these gene symbols, I would like to get the corresponding <Gene Description> <GO Terms>.

Example:
Gene symbol: Stat5b
<Gene Description>: Signal transducer and activator of transcription 6
​<GO Terms>: negative regulation of type 2 immune response (GO:0002829); DNA binding (GO:0003677); transcription factor activity, sequence-specific DNA​

I tried using "kgXref" table with "goaPart"-> "go" "term" table -> selecting fields and then "get output".  But the query runs for a couple minutes then times out.

What would be an alternative way to retrieve this information (preferably in the format specified above for GO Terms)?

Andy


--
Andy Rampersaud
Graduate Student, Bioinformatics
Waxman Lab, Boston University

Christopher Lee

unread,
Dec 10, 2015, 12:12:50 PM12/10/15
to Andy Rampersaud, gen...@soe.ucsc.edu
Dear Andy,

Thank you for your question about retrieving the GO terms for a list of corresponding RefSeq Genes.
Unfortunately it is not possible to use the Table Browser for this query, as the request will time
out, but there are a few different ways you can still obtain that information.

If you have MySQL client libraries installed on your computer, you can connect to our public MySQL database.
https://genome.ucsc.edu/goldenPath/help/mysql.html

Below is a query which should provide the gene symbol, description and associated terms all on one line, with the terms as a comma separated list:

mysql -A -u genome -h genome-mysql.soe.ucsc.edu -Ne 'select l.name,k.description,t.name from refLink l,refGene g, go.term t, go.goaPart p, kgXref k where l.mrnaAcc=g.name and l.name=k.geneSymbol and k.spId=p.dbObjectId and p.goId = t.acc' mm9 | awk 'BEGIN {FS="\t"} {if ($1 == last) printf "%s,",$3; else {printf "\n%s\t%s\t%s,", $1,$2,$3} last=$1 } END {printf "\n"}'

You can also download GO annotations directly using QuickGo:
https://www.ebi.ac.uk/QuickGO/GAnnotation?tax=9606&a=&termUse=ancestor&relType=IPO%3D&customRelType=IPOR%2B-%3F%3D&protein=&qualifier=&goid=&ref=&evidence=&with=&source=&q=&col=proteinDB%2CproteinID%2CproteinSymbol%2Cqualifier%2CgoID%2CgoName%2Caspect%2Cevidence%2Cref%2Cwith%2CproteinTaxon%2Cdate%2Cfrom%2Csplice&select=normal

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genome Bioinformatics Group

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Andy Rampersaud

unread,
Dec 11, 2015, 5:16:19 PM12/11/15
to Christopher Lee, gen...@soe.ucsc.edu
Hi Christopher,

Thank you for your helpful response.  I was successful with installing MySQL on my computer and running the query (only 20 seconds to run) and the output is properly formatted (thanks for the pipe into awk).  I have an additional issue I was hoping to address:

I'm working with a list of 24K RefSeq gene symbols, but some of them (about 3800) are missing from the Table Browser output.  Here's my observation: For gene symbols with a "description", when the user searches for the name, there's a description drop-down.  For gene symbols without a "description", when the user searches for the name, there's no drop-down.

gene symbols with a "description":
Marveld1
Rad17
Stat5a
Tekt5
Zap70

gene symbols without a "description":
Babam1
Dcstamp
Rubie
Supt4a

However, when the user clicks on any gene symbol, there's always a label present for the "Description" and "PubMed on Gene" fields.  I don't know why the 3800 would be missing from the Table Browser output. My question: is there a way to get a full list of gene symbols with corresponding <Gene Descriptions>?  I tried running the following query:

mysql -A -u genome -h genome-mysql.soe.ucsc.edu -Ne 'select DISTINCT l.name,k.description from refLink l, kgXref k  where l.name=k.geneSymbol' mm9

But I'm still missing the above set of [gene symbols without a "description"].  Is there another query I can run to get the full list?

Thanks,
Andy

Christopher Lee

unread,
Dec 16, 2015, 12:37:24 PM12/16/15
to Andy Rampersaud, gen...@soe.ucsc.edu

Dear Andy,

Thank you for your question about the missing gene descriptions
from the table browser output and drop-down search. The reason for
all the discrepancies you have noted has to do with different
tables being updated at different times than others.

The reason the kgXref table is missing certain gene descriptions
is because it was built along with the knownGene table, which
for mm9 has not been updated for a few years. This causes
kgXref to become out of sync with the refGene and refLink tables,
both of which are updated almost weekly. Thus in your mysql query
when you try to use kgXref.description there will be some gene
descriptions missing.

The issue with the descriptions appearing in the drop-down for some
items and not for others is because the drop-down description comes
from the knownCanonical table, which does not contain the items
you have mentioned. However, when you click on a refGene item in the
refSeq track, and you see the description, that description comes from
the refLink table. refLink and refGene are updated frequently,
as mentioned above, thus containing the genes you are
looking for.

I hope this is helpful. If you have any further questions, please reply
to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes
sensitive data, youmay send it instead to genom...@soe.ucsc.edu.

Andy Rampersaud

unread,
Jan 6, 2016, 11:19:27 AM1/6/16
to Christopher Lee, gen...@soe.ucsc.edu
Hi Christopher,

Thank you for the information regarding the table update frequency.  Just to conclude the issue; here's the MySQL query that I ended up using:

mysql -A -u genome -h genome-mysql.soe.ucsc.edu -Ne 'select DISTINCT g.name2, l.product from refGene g, refLink l  where g.name=l.mrnaAcc' mm9 > UCSC_Table_Query_4.txt

I figured I would just use the most up-to-date tables.  The resulting file is sufficient for my needs.  It has descriptions for most of the gene symbols listed above:

#---------------------------------------------------------------------------------
#grep 'Babam1' UCSC_Table_Query_4.txt
#Babam1    BRISC and BRCA1-A complex member 1
#grep 'Dcstamp' UCSC_Table_Query_4.txt
#Multiple listings
#grep 'Rubie' UCSC_Table_Query_4.txt
#Still has blank description
#grep 'Supt4a' UCSC_Table_Query_4.txt
#Supt4a    transcription elongation factor SPT4-A
#---------------------------------------------------------------------------------

Thanks for your help,
Andy
Reply all
Reply to author
Forward
0 new messages