Standardized gene names and dates in PANCAN_mutation

64 views
Skip to first unread message

daniel.hi...@gmail.com

unread,
Jul 18, 2016, 12:26:24 PM7/18/16
to UCSC Xena and Cancer Genomics Browser
Gene names have been converted to dates in `PANCAN_mutation`. See https://github.com/cognoma/cancer-data/issues/4

As a potential workaround, do you have a mapping of mutations to standardized gene identifiers? In general, I think it makes sense to use standardized gene identifiers rather than symbols in datasets, since symbols can be ambiguous and are prone to formatting errors.

We're also interested in mapping the gene symbols in `HiSeqV2` to standardized identifiers.

Mary Goldman

unread,
Jul 22, 2016, 2:43:43 PM7/22/16
to Daniel Himmelstein, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

Thank you for your patience this week!

I am looking into the gene names being converted into dates and will get back to you.

While we agree that gene identifiers are more standardized, our target users for our website are biologists, who, in general, tend to be more familiar with the symbols rather than identifiers.

Please feel free to map any of our identifiers to a more standardized set of gene identifiers.

Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/



--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mary Goldman

unread,
Jul 22, 2016, 4:41:36 PM7/22/16
to Daniel Himmelstein, UCSC Xena and Cancer Genomics Browser
Hi Daniel,

We just checked our files and the gene names that are converted to dates are part of the input MAF data file we got from TCGA DCC, which is from the sequencing center, such as Broad.

Unfortunately we can't point you to the exact place we obtained the file since NCI recently moved all the TCGA data to the GDC. We are currently working on changing our scripts to extract data from this location instead.


Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/



Jing Zhu

unread,
Jul 22, 2016, 4:51:01 PM7/22/16
to Mary Goldman, Daniel Himmelstein, UCSC Xena and Cancer Genomics Browser
We are interested in moving to incorporating gene IDs (the numeric form) in our data wrangling process and incorporate the mapping in the "probeMap" files.  We are still working on all new scripts to download and wrangled TCGA data from GDC. It will take a bit of time, and I think it will be a good idea to re-run your notebook code on all new data once we have it. 

Jing

Jing Zhu

unread,
Jul 22, 2016, 5:32:14 PM7/22/16
to Daniel Himmelstein, Mary Goldman, UCSC Xena and Cancer Genomics Browser


On Fri, Jul 22, 2016 at 2:21 PM, Daniel Himmelstein <daniel.hi...@gmail.com> wrote:
Hi Jing,

Thanks for diving deep into the gene identification issue.

I think it's fine to use symbols as long as the mapping is reversible to standardized identifiers. For example, if there is a file that contains a (one-to-one) Symbol–ID mapping, we can easily convert the dataset without fear of corruption.

If I understand correctly: Currently there is no file that will allow us to map `HiSeqV2` from symbols to IDs. We can always use a third-party mapping, but should expect some information loss due to symbol ambiguity and version differences. In the future, Xena plans to modify its data wrangling to become more aware of gene IDs and will be able to provide a symbol-to-ID map?


​There was many available till about this summer.   I don't know where you can publicly download the previous old UNC data. But we have a local copy before the DCC transition.   I will send out one of the expression data file that you can use to find the mapping.

Jing
 ​
 
If that's the case, then I think our current approach will be to adopt a lossy conversion to entrez gene followed by a lossless conversion when the Xena updates arrive. Does that make sense?

Best,
Daniel

Jing Zhu

unread,
Jul 22, 2016, 6:02:50 PM7/22/16
to Daniel Himmelstein, Mary Goldman, UCSC Xena and Cancer Genomics Browser
Use attached file's first column to map symbols to gene numeric ids.

Jing
unc.edu.f72bfbe6-411d-412e-aaab-1a2414e544ec.2146068.rsem.genes.normalized_results

Daniel Himmelstein

unread,
Jul 25, 2016, 12:45:36 PM7/25/16
to Jing Zhu, Mary Goldman, UCSC Xena and Cancer Genomics Browser
Hi Jing,

Thanks for diving deep into the gene identification issue.

I think it's fine to use symbols as long as the mapping is reversible to standardized identifiers. For example, if there is a file that contains a (one-to-one) Symbol–ID mapping, we can easily convert the dataset without fear of corruption.

If I understand correctly: Currently there is no file that will allow us to map `HiSeqV2` from symbols to IDs. We can always use a third-party mapping, but should expect some information loss due to symbol ambiguity and version differences. In the future, Xena plans to modify its data wrangling to become more aware of gene IDs and will be able to provide a symbol-to-ID map?

If that's the case, then I think our current approach will be to adopt a lossy conversion to entrez gene followed by a lossless conversion when the Xena updates arrive. Does that make sense?

Best,
Daniel
On Fri, Jul 22, 2016 at 4:51 PM Jing Zhu <jingc...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages