Grep not working properly on kgXref.txt

28 views

Skip to first unread message

Mei San Tang

unread,

Jun 26, 2015, 6:17:01 PM6/26/15

to gen...@soe.ucsc.edu

Dear UCSC team,

I have a list of gene symbols (from a microarray experiment) that I would like to convert to the matching UCSC IDs. I downloaded the mm9 kgXref.txt (from http://hgdownload.soe.ucsc.edu/downloads.html) and was trying to do the conversion on Unix:

grep -Fwf gene_symbols kgXref.txt > gene_symbols_key

However, I noticed that my grep command pulled out many other genes from the kgXref.txt file that were not in the original list of gene_symbols. I also did a reverse grep (with -v) to identify all the genes pulled out from kgXref.txt but were not on the original list.

I have tried several methods, including regenerating the list of gene symbols, re-downloading the kgXref.txt file and using grep -wf, but none of them helped. I'd appreciate any input on this. The files described are all attached.

Thanks,

Mei San

gene_symbols_key.txt

gene_symbols

unmatched

Matthew Speir

unread,

Jun 26, 2015, 7:22:10 PM6/26/15

to Mei San Tang, gen...@soe.ucsc.edu

Hi Mei,

Thank you for your questions about matching your gene symbols to UCSC Genes identifiers. The kgXref table contains many things other than the UCSC Genes ID and the gene symbol, you can see a full description of the contents here: http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=kgXref&hgta_doSchema=describe+table+schema. The grep command you are using will print out the entire line where a match is found. This means that if you have an entry in your file "gene_symbols" like

    Oprk1

Then you will get lines like this in your output file "gene_symbols_key" file:

    uc007afo.1    NM_011011    Q14AL5    Q14AL5_MOUSE    Oprk1    NM_011011    NP_035141    kappa-type opioid receptor

I assume what you are looking for is file the contains just a mapping of your gene symbols to the UCSC Genes IDs, like so:

    uc007afo.1 Oprk1

To get just these two columns in your output, you will need to trim out the excess columns in your "kgXref.txt" file. To do this, use the following command:

    awk '{print $1,$5}' kgXref.txt > kgXref.ucId.geneSymbol.txt

This will give you a file that contains the UCSC Genes IDs in column one and the gene symbols in column two. You can then feed that file into your grep command to map your gene symbols to UCSC Genes IDs like so:

    grep -Fwf gene_symbols kgXref.ucId.geneSymbol.txt > gene_symbols_key.V2.out

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group