Hi Mei,
Thank you for your questions about matching your gene symbols to
UCSC Genes identifiers. The kgXref table contains many things other
than the UCSC Genes ID and the gene symbol, you can see a full
description of the contents here:
http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=kgXref&hgta_doSchema=describe+table+schema.
The grep command you are using will print out the entire line where
a match is found. This means that if you have an entry in your file
"gene_symbols" like
Oprk1
Then you will get lines like this in your output file
"gene_symbols_key" file:
uc007afo.1
NM_011011 Q14AL5 Q14AL5_MOUSE Oprk1 NM_011011
NP_035141 kappa-type opioid receptor
I assume what you are looking for is file the contains just a
mapping of your gene symbols to the UCSC Genes IDs, like so:
uc007afo.1 Oprk1
To get just these two columns in your output, you will need to trim
out the excess columns in your "kgXref.txt" file. To do this, use
the following command:
awk '{print $1,$5}'
kgXref.txt > kgXref.ucId.geneSymbol.txt
This will give you a file that contains the UCSC Genes IDs in column
one and the gene symbols in column two. You can then feed that file
into your grep command to map your gene symbols to UCSC Genes IDs
like so:
grep -Fwf
gene_symbols kgXref.ucId.geneSymbol.txt >
gene_symbols_key.V2.out
I hope this is helpful. If you have any further questions, please
reply to
gen...@soe.ucsc.edu. All messages sent to that address are
archived on a publicly-accessible Google Groups forum. If your
question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.
Matthew Speir
UCSC Genome Bioinformatics Group