How to convert UCSC-Gene-IDs into Gene Symbols

1,144 views

Skip to first unread message

Achim Bell

unread,

Apr 18, 2014, 2:41:46 PM4/18/14

to gen...@soe.ucsc.edu, Di

2014-04-18

Subject:

How to convert UCSC-Gene-IDs into Gene Symbols

From:

Achim Bell, Ph.D.

Dear people at UC Santa Cruz,

I need to convert my list of 62500 UCSC-Gene-IDs into the official gene symbols,

(which I converted into a file in csv-format).

On Google Group I found a procedure from Pauline Fujita, from UCSC Genome Bioinformatics Group,

see at: https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/V_HsjEvCv1k.

I followed this procedure using the UCSC table browser site at: http://genome.ucsc.edu/cgi-bin/hgTables.

Unluckily this doesn’t work correctly for me.

When I submitted my list of 62500 UCSC-gene-IDs in csv-format,

it said, it doesn't recognize around 210 UCSC-identifiers, which is a pity but I can accept that.

Here comes the real problem:

At the end your UCSC server program sends to me a results output file;

this only shows one result column with a list of 63612 official gene symbols.

These are 1112 genes more than I originally asked to be converted.

And this is in contradiction to that your program says it cannot identify around 210 of the UCSC-gene-IDs when I submitted my data.

Since the UCSC result file doesn’t show/ produce a column of my original UCSC-gene-IDs aligned to the result column with the official gene symbols, there is no way for me to figure out what went wrong here.

I would be happy if you can help me in this matter.

P.S.

Yesterday I tried already to send to you my original files with the UCSC-gene-IDs (733KB) I need to convert,

and also the UCSC result list (1.6MB) with the official gene symbols I received,

but my email was sent back to me saying that you cannot accept e-mails with sizes above 100KB.

I am happy to send to you my original tables if you give me an Email address which can accept these file sizes.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Description of my detailed settings, I used on the UCSC table browser site

at: http://genome.ucsc.edu/cgi-bin/hgTables for my gene ID conversion:

'clade:' "Mammal"

'genome:' "Human"

'assembly:' "Feb. 2009 (GRCh37/hg19)"

'group:' "Genes and Gene Predictions"

'track:' "UCSC Genes"

'table:' "kgXref"

'region:' "genome"

In the "identifiers (names/accessions)" section,

I uploaded/ submitted the list of my 62500 UCSC-gene-IDs as a csv file.

Then I selected: 'output format:' "selected fields from primary and related tables"

I named my 'output file:' "UCSC Gene Name Conversion"

I selected 'file type returned:' "plain text"

Now I clicked on "get output", this opened a new menu page.

Here I specified/ checked the fields “gene symbol”, and “gene description” in the different tables.

I scrolled to the bottom and clicked on "allow selection from checked tables"

Then I moved up and clicked on "get output".

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I appreciate your help,

Thanks,

Achim Bell

Jonathan Casper

unread,

Apr 21, 2014, 8:20:49 PM4/21/14

to Achim Bell, gen...@soe.ucsc.edu, Di

Hello Achim,

Thank you for your question about converting UCSC gene IDs into gene names. I agree, it does not sound like using the "identifiers" is quite appropriate for your needs. You can add the UCSC ID number to the results of your query by also checking the field "kgID" in addition to "gene symbol" and "gene description", which may help with your troubleshooting, but I am also quite confused why you received more than 1000 extra lines. You can send the data files to me privately if you would like some help looking into it.

One of our engineers notes that you should also be able to obtain results from our public MySQL server (http://genome.ucsc.edu/goldenPath/help/mysql.html) with the following UNIX command:

mysql -A -u genome -h genome-mysql.soe.ucsc.edu 'select name,geneSymbol from knownGene,kgXref where kgID = name order by name' > ucscGeneSymbols.txt

Then, if you have a text file in which each line contains one UCSC gene ID, you can extract those commands using the grep command like this:

grep -Fwf myUcscGenes.txt ucscGeneSymbols.txt

If you do not have access to a UNIX command line, you may instead be interested in using the tools at Galaxy (https://usegalaxy.org). Galaxy allows you to take the text from queries to the UCSC Table Browser and interact with it directly. You would start by loading a UCSC Table Browser query with the kgId and geneSymbol columns from the kgXref table, and then use Galaxy's "Join two Datasets" tool (part of the "Join, Subtract, and Group" tool set) to collect only lines that match your list of UCSC ID numbers. Your ID list would be the first data set, and the results of the UCSC Table Browser query would be the second data set. If you would like to follow this path, please note that you should not specify your identifiers in the UCSC Table Browser. The Galaxy Join tool will take care of imposing that limit.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead togenom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

Reply all

Reply to author

Forward

0 new messages