canonicalization of query molecules

48 views
Skip to first unread message

Saurabh Srivastava

unread,
Nov 11, 2012, 8:26:53 PM11/11/12
to indig...@googlegroups.com
Hi Indigo dev,

What is the technical limitation in canonicalizing query SMILES? Ideally the functionality I need is that query reaction and getting its canonicalSmiles.

If for non-query molecules the canonicalSmiles algorithm is at all similar to InChI's (http://depth-first.com/articles/2006/08/12/inchi-canonicalization-algorithm/) then it appears one should be able to just treat R groups as another "atom type" (maybe a level below C) and run the same algorithm over it.

If there is no technical limitation, I might be willing to work on the code to get this extension going.
Thanks.
Saurabh

Mikhail Rybalkin

unread,
Nov 12, 2012, 3:38:06 AM11/12/12
to indig...@googlegroups.com
Hello Saurabh,

The technical limitation with canonicalizing query SMILES is that in general with are dealing with SMARTS, and the same feature constraint can be represented differently with different SMARTS. For example we can specify [N;v3] or [N+0;X3] that seems to be the same. The problem is how to find canonical SMARTS for a single atom. Is it important in your case?

In case of query SMILES I think this is not a big problem and we can do that. Could you give an examples of your query SMILES?

The main issue with canonical SMILES for reaction is not to treat atom-to-atom mapping. Can we renumber them? 

Mikhail

Saurabh Srivastava

unread,
Nov 21, 2012, 8:41:01 PM11/21/12
to indig...@googlegroups.com
Hi Mikhail, 

Sorry for the delayed response. I completely understand the general problem being difficult (but not undecidable right? As far as I can see it is regex equivalence problem: http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions; we cannot encode loops in SMARTS to make it undecidable, right?). My problem is somewhat simpler in that I need to ensure that under a fully permissive SMARTS wildcard, i.e., any atom, I need to check equivalence of reactions. (I can even assume that every wildcard is just a match against "any atom")

My requirements are best illustrated by example. Below are four SMARTS that I need to ensure can be identified to be equivalent:
- [H,*:1]O[H].[H,*:2]N([H,*:4])C(=[H,*:3])[H,*:5]>>[H,*:1]OC(=[H,*:3])[H,*:5].[H,*:2]N([H,*:4])[H]
- [H,*:1]C(=[H,*:2])N([H,*:4])[H,*:5].[H,*:3]O[H]>>[H,*:1]C(=[H,*:2])O[H,*:3].[H,*:4]N([H,*:5])[H]
- [H,*:1]O[H].[H,*:2]N([H,*:5])C([H,*:3])=[H,*:4]>>[H,*:1]OC([H,*:3])=[H,*:4].[H,*:2]N([H,*:5])[H]

As you can see they are syntactically different but identical if you visualize them. And all wildcard are just "any atom". This is not as difficult as the general problem.

Thanks a lot for your help;
saurabh
Reply all
Reply to author
Forward
0 new messages