TCGA UCSC IDs Help

167 views
Skip to first unread message

Leland Dunwoodie

unread,
Aug 18, 2016, 5:37:40 PM8/18/16
to gen...@soe.ucsc.edu
Hello!

Leland Dunwoodie here. I'm an undergraduate researcher at Clemson University in South Carolina. My lab is working with Level 3 RNASeqV2 data from the Cancer Genome Atlas (TCGA). We are having difficulties determining which annotation of UCSC was used to create this data and how to map this data to other IDs, specifically Ensembl IDs. I tried using DAVID to map these IDs to UCSC IDs and then to Ensembl IDs, but about 6,000 of about 73,000 IDs were unable to map to either UCSC or Ensembl. Attached are the Gene IDs we are using. My questions are as follows:
1) Which annotation of UCSC is this?
2) Does a mapping table exist for this annotation?
3) Do you have any other advice in working with the publicly-available RNASeqV2 data from the Cancer Genome Atlas?

Thanks!
Best,
Leland​​​​ 

--
Leland Dunwoodie
Clemson '18 | Biochemistry | Calhoun Honors College
Academic Affairs Committee Chair | Undergraduate Student Senate
Pronouns: He, Him, His

"Do your best and forget the rest!"
TCGA UCSC IDs text.txt

Matthew Speir

unread,
Aug 24, 2016, 6:17:46 PM8/24/16
to Leland Dunwoodie, gen...@soe.ucsc.edu
Hi Leland,

Thank you for your question about UCSC Genes identifiers in the TCGA data.

It looks like these IDs are from an older version of the UCSC Genes track (knownGene 5) that was released back in 2009. We don't have a table to accurately map each UCSC ID to an Ensembl ID. However, you can update these IDs to the most recent ones available for hg19 using a combination of our table dumps and UNIX commands. Hopefully have more up-to-date IDs will help with your conversion in DAVID. I was able to take the IDs you provided and convert them to more current ones using these steps:

1. Place your list of unmapped IDs from DAVID into a text file, I named mine something like "tcga.ucscIds.unmapped.txt".

2. Obtain the dumps for the "kg5ToKg6" and "kg6ToKg7" tables for the hg19 human assembly from our downloads server here: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/.

3. Find the kg6 IDs for your kg5 IDs. To do so, use the UNIX command grep and awk and put these IDs into a file like so:
    
        grep -f tcga.ucscIds.unmapped.txt kg5ToKg6.txt | awk '{print $5}' > tcga.ucscIds.unmapped.5to6.txt
   
        Note the file "kg5ToKg6.txt" is the dump of the kg5ToKg6 table that you obtained in step 2.
        Those IDs with no match in the new gene prediction set will have "none" in the column.

4. Filter out those items with no matches in the new set:

        grep -v "none" tcga.ucscIds.unmapped.5to6.txt |awk '{print $2}' > tcga.ucscIds.unmapped.6only.txt

5. Find the kg7 IDs for your kg6 IDs and then filter our those with no matches in the new set:
       
        grep -f tcga.ucscIds.unmapped.6only.txt kg6ToKg7.txt | |grep -v "none" | awk '{print $5}' > tcga.ucscIds.unmapped.6to7.txt

Some of these steps (especially 3 and 5) might take some time to complete, so ensure you have the time to run them before starting them.

If you input these updated IDs into DAVID, you should hopefully see some more mappings between these UCSC IDs and Ensembl Gene IDs, although, I would be surprised if you ever see a complete 100% conversion of these IDs. There may be some cases where there are UCSC IDs with no correspondence with any Ensembl IDs or vice versa.

Lastly, I don't really have much advice for working with TCGA data as that's outside the scope of this mailing list. This mailing list is intended to provide assistance with the UCSC Genome Browser software, http://genome.ucsc.edu/, and data that we provide. If you have specific questions about working with TCGA data, I would recommend searching around the web for answers or asking them on a more general bioinformatics help forum, such as Biostars (https://www.biostars.org/) or SeqAnswers (http://seqanswers.com/).

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages