Incosisitencies in the compound structure data for the CMap Drug Safety Challenge

Anika Liu

unread,

Apr 1, 2019, 9:08:32 AM4/1/19

to CAMDAforum

Hello,

We found that some compounds are assigned multiple SMILES, which are sometimes significantly different, and are unsure how the compound structure information can be used in the challenge. For example, emetine appeared twice: One time with the correct smiles and once with a significantly different one (beta-chloroalanine). Also for daunorubicin, four SMILES describe different stereoisomers, but in the fifth SMILES the atom positions seem to be switched.

In addition, we found aminomethyltransferase among the compounds and were surprised to see that it was assigned with a typical small molecule SMILES.

Most of the mismatches were found by chance/manual inspection. Therefore, we don't know how frequently these mismatches occur. In order to proceed with the structure information, it would be very helpful to know how the SMILES were derived, and in case of mismatches, whether we should trust the compound name or the compound structure.

In addition, we found a range of compounds for which SMILES are not available. As the challenge description stated that SMILES are provided for all drugs we wanted to double-check if information on those might be available? This includes well-known small molecules like vemurafenib, but also names like 'EI-346-erlotinib-analog' for which a structure is not easy to derive.

Best wishes and thank you,

Anika

Maciej Kańduła

unread,

Apr 1, 2019, 9:18:17 AM4/1/19

to CAMDAforum

Hi, Anika!

Thanks for your message and reporting the issues!

I'm looking into it.

Best regards,

Maciek

Maciej Kańduła

unread,

Apr 1, 2019, 10:32:32 AM4/1/19

to Anika Liu, CAMDAforum

Dear Anika,

These particular SMILES are part of what Broad Institute has provided in their official GEO data set entry [1], specifically the file

GSE92742_Broad_LINCS_pert_info.txt
Your issue is coming from this mapping file exactly. When I search for 'emetine' I get three different entries with 'canonical_smiles'.
However, this file (GSE92742_Broad_LINCS_pert_info.txt) also holds some other additional information.

Maybe you can try using it to filter out the unwanted / false SMILES.

Broad provides a guide [2] (search for "SMILES") but I'm afraid they are not very specific about this particular part of the data set and it seems they don't really know how SMILES in this data set were generated.

We have mapped the expression data via the 'pert_iname' entry as it holds the common drug names.
Alternatively, you could work with 'pert_id' instead of 'pert_iname' entry for mapping the drugs (remapping the whole data set on 'pert_id'). These are unique but Broad-specific. So... in the end you need to get the common drug names somehow anyway. It is a bit confusing but that is how the original data set is constructed.
We do have a corresponding data set generated on 'pert_id' and I'm happy to provide it as well.

It might be better for you to use the original data set [1] though, and just play with it in a way that you find most useful.
Important detail is just to filter on the selected cell lines, as they yield the largest amount common drugs for all of them:
sel.celllines <- c("A375", "A549", "HT29", "MCF7", "PC3", "VCAP", "ASC", "SKB", "PHH", "NPC", "HCC515", "HA1E", "HEPG2")

Looking forward to hearing from you!

Kind regards,
Maciek

[1] https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742
[2] https://docs.google.com/document/d/1q2gciWRhVCAAnlvF2iRLuJ7whrGP6QjpsCMq1yWz7dU/edit#heading=h.dbcy47xxrged

--
You received this message because you are subscribed to the Google Groups "CAMDAforum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to camdaforum+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/camdaforum/1d87ecef-5cfa-418c-9769-0545924a01f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anika Liu

unread,

Apr 2, 2019, 5:50:14 AM4/2/19

to CAMDAforum

Thanks Maciek,

we will have a look into it and see how to proceed.

Best wishes,

Anika

To unsubscribe from this group and stop receiving emails from it, send an email to camda...@googlegroups.com.

Reply all

Reply to author

Forward