These particular SMILES are part of what Broad Institute has provided in their official GEO data set entry [1], specifically the file
Maybe you can try using it to filter out the unwanted / false SMILES.
Broad provides a guide [2] (search for "SMILES") but I'm afraid they are not very specific about this particular part of the data set and it seems they don't really know how SMILES in this data set were generated.
We have mapped the expression data via the 'pert_iname' entry as it holds the common drug names.
Alternatively, you could work with 'pert_id' instead of 'pert_iname' entry for mapping the drugs (remapping the whole data set on 'pert_id'). These are unique but Broad-specific. So... in the end you need to get the common drug names somehow anyway. It is a bit confusing but that is how the original data set is constructed.
We do have a corresponding data set generated on 'pert_id' and I'm happy to provide it as well.
It might be better for you to use the original data set [1] though, and just play with it in a way that you find most useful.
Important detail is just to filter on the selected cell lines, as they yield the largest amount common drugs for all of them:
sel.celllines <- c("A375", "A549", "HT29", "MCF7", "PC3", "VCAP", "ASC", "SKB", "PHH", "NPC", "HCC515", "HA1E", "HEPG2")
Looking forward to hearing from you!
Kind regards,
Maciek
[1]
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742[2]
https://docs.google.com/document/d/1q2gciWRhVCAAnlvF2iRLuJ7whrGP6QjpsCMq1yWz7dU/edit#heading=h.dbcy47xxrged