Hi Leland,
Thank you for your question about UCSC Genes identifiers in the TCGA
data.
It looks like these IDs are from an older version of the UCSC Genes
track (knownGene 5) that was released back in 2009. We don't have a
table to accurately map each UCSC ID to an Ensembl ID. However, you
can update these IDs to the most recent ones available for hg19
using a combination of our table dumps and UNIX commands. Hopefully
have more up-to-date IDs will help with your conversion in DAVID. I
was able to take the IDs you provided and convert them to more
current ones using these steps:
1. Place your list of unmapped IDs from DAVID into a text file, I
named mine something like "tcga.ucscIds.unmapped.txt".
2. Obtain the dumps for the "kg5ToKg6" and "kg6ToKg7" tables for the
hg19 human assembly from our downloads server here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/.
3. Find the kg6 IDs for your kg5 IDs. To do so, use the UNIX command
grep and awk and put these IDs into a file like so:
grep -f tcga.ucscIds.unmapped.txt kg5ToKg6.txt | awk '{print
$5}' > tcga.ucscIds.unmapped.5to6.txt
Note the file "kg5ToKg6.txt" is the dump of the kg5ToKg6
table that you obtained in step 2.
Those IDs with no match in the new gene prediction set will
have "none" in the column.
4. Filter out those items with no matches in the new set:
grep -v "none" tcga.ucscIds.unmapped.5to6.txt |awk '{print
$2}' > tcga.ucscIds.unmapped.6only.txt
5. Find the kg7 IDs for your kg6 IDs and then filter our those with
no matches in the new set:
grep -f tcga.ucscIds.unmapped.6only.txt kg6ToKg7.txt | |grep
-v "none" | awk '{print $5}' > tcga.ucscIds.unmapped.6to7.txt
Some of these steps (especially 3 and 5) might take some time to
complete, so ensure you have the time to run them before starting
them.
If you input these updated IDs into DAVID, you should hopefully see
some more mappings between these UCSC IDs and Ensembl Gene IDs,
although, I would be surprised if you ever see a complete 100%
conversion of these IDs. There may be some cases where there are
UCSC IDs with no correspondence with any Ensembl IDs or vice versa.
Lastly, I don't really have much advice for working with TCGA data
as that's outside the scope of this mailing list. This mailing list
is intended to provide assistance with the UCSC Genome Browser
software,
http://genome.ucsc.edu/, and
data that we provide. If you have specific questions about working
with TCGA data, I would recommend searching around the web for
answers or asking them on a more general bioinformatics help forum,
such as Biostars (
https://www.biostars.org/) or
SeqAnswers (
http://seqanswers.com/).
I hope this is helpful. If you have any further questions, please
reply to
gen...@soe.ucsc.edu. All
messages sent to that address are archived on a publicly-accessible
Google Groups forum. If your question includes sensitive data, you
may send it instead to
genom...@soe.ucsc.edu.
Matthew Speir
UCSC Genome Bioinformatics Group