extractCommonScaffold questions

Andrew Dalke

unread,

Apr 16, 2012, 9:55:21 AM4/16/12

to indigo-...@googlegroups.com

Hi all,

I've been developing a new MCS algorithm to find the MCS of a set
of structures. It is (currently) called 'fcmcs' and uses RDKit for
the chemistry. It uses a rather different approach than the standard
clique approach. It's currently written in Python.

I'm at the stage where I'm testing it against other MCS tools. I tested
my code against the pairwise MCS algorithm of SMSD
http://www.ebi.ac.uk/thornton-srv/software/SMSD/
last Friday and Saturday, and yesterday I started testing both the
pairwise and the multiple structure cases with Indigo.

I have some questions (marked with "Q:") regarding the Indigo
implementation of extractCommonScaffold. They are inline with
my description of the comparison I've been doing. I figured
that some of the Indigo developers may be interested in the
comparison numbers.

My protocol is:
- take ChEMBL-13 and make a new data set with only the largest fragments
- save the new structures as a SMILES (Indigo points out there is
contradictory stereo information in some of the ChEMBL-13 data
files, and refuses to load those structures. I used another tool
for the conversion, and I presume it ignores that problem.)

For the pairwise comparison:
- extract two structures at random
- find the MCS using the two algorithms under test
- set a limit of 30s of CPU time

For the group comparison:
- generate RDKit fingerprints for the ChEMBL-13 data set
- pick a fingerprint at random
- find the k=10 (or later, 100) nearest structures with at
least 0.80 Tanimoto similarity; there must be at least
5 compounds in the data set. There may be fewer than k.
- find the MCS of that data set

I am pleased to report that Indigo's extractCommonScaffold is
usually very fast, which I think you all already knew.

Using the pairwise protocol above, I tested 3219 pairs (excluding
the one Indigo case and two fcmcs cases which timed out at 30s).

- Indigo took 60.1 seconds, of which the longest was 0.83 seconds
- fcmcs took 350. seconds, of which 5 took over 10 seconds and 38
took over 1 second.

(This time is biased. I have removed 0.109s that it takes for
Indigo to find the MCS of "CC" against itself, and the 0.219s
it takes for fcmcs to do the same. This should minimize the impact
of startup costs. I processed one MCS calculation per run so I
could use rlimit to enforce an MCS search timeout for each run.
The program I used is at the end of this email.)

Of those 3219 cases:
- Indigo found a larger MCS (more atoms) 177 times (5.5%),
on average 2.34 atoms larger than what fcmcs found
- fcmcs found a larger MCS 618 times (19%),
on average 2.36 atoms larger

I can think of several possible reasons for these differences:

1) RDKit and Indigo have different aromaticity models. I observed
with SMSD that 7.2% of the time it found a average 2.1 atom larger MCS,
and 6.7% of the time fcmcs found an average 2.1 atom larger MCS.
SMSD uses CDK for its chemistry model.

I set the "deconvolution-aromatization" flag to True, so both Indigo
and RDKit are perceiving their own respective aromaticity. In my
next analysis I'll have RDKit generate the SMILES and let Indigo use
the process SMILES as input, because I believe Indigo does not modify
the chemistry upon input.

Q: If I let RDKit generate the SMILES then how do I get Indigo
to use the RDKit aromaticity expressed in the SMILES? Am I
correct in thinking that I don't need to do anything for that case?

2) I believe I read in an email posting that extractCommonScaffold()
maximizes the number of rings in the MCS, while I've configured
fcmcs here to maximize the number of atoms, with ties broken on the
number of bonds. There are obvious cases where the largest MCS does
not contain a ring while a smaller common substructure does.

Q: What subgraph property does Indigo maximize for the MCS search?
How are ties broken?

3) I configured my code to match atoms based on element and match
bonds based on atom type (single, aromatic, double, and triple).
That gave the best match to the Indigo results. However, I was not
able to find documentation for this.

Q: How does Indigo tell if two atoms/bonds are equal?

BTW, some of the other variations I've come across are:
- ignore bond types completely (this is the default for SMSD)
- ignore bond types and elements (this is a topology-only MCS)
- assume that aromatic atoms can also match single and double bonds
- ring bonds can only match ring bonds; non-ring bonds can only
match non-ring bonds

Q: Is there some way to specify an alternate match method
in Indigo? If so, what match variations are supported?

With the pair-wise comparison complete, I went onto the
group MCS comparison. The method I chose gives highly similar
structures.

Out of 201 comparisons:
- Average MCS size: 22.1 atoms for Indigo, 22.2 atoms for fcmcs

- Indigo time: 178.7 seconds with 9 timeouts (giving no MCSes)
adding in 30 seconds per timeout gives 448.7 seconds (2.23 sec/MCS)
- fcmcs time: 443.7 seconds with 52 timeouts (but a large common substructure is known)
adding in the timeouts gives 2008. seconds (10. sec/MCSes)
==> Indigo is about 5x faster than fcmcs

- Indigo finds a larger substructure (more atoms) 3 times, with 1 atom each time
- fcmcs finds a larger substructure 8 times, with an average of 3.6
more atoms
=> 95% of the time they report the same values

My algorithm supports a 'timeout' value. When it notices that the
timeout has been reached, it will report the best structure seen
to date. I observed that my algorithm often finds the MCS within
the first few seconds then spends a lot of proving that alternatives
cannot be better, or only increasing the MCS slightly.

Thus, I could use a soft timeout (meaning, implemented by the code
and not by the OS) of 30 seconds and always get some decent common
structure.

Q: Does Indigo support a timeout settings? Is there some way to
get its current best match once some time has been exceeded?

I tested soft timeouts of 0.2, 0.5, and 5.0 seconds for fcmcs,
rather than the hard rlimit-based solution I used for Indigo.
This gives me a large common substructure, but not necessarily the
MCS.

I found that 5.0 seconds was a good compromise between performance
and size. After running 1,000 comparisons I found:

- Indigo time: 605.7 seconds with 56 timeouts (giving no MCSes)
adding in 30 seconds per timeout gives 2286 seconds (2.29 sec/MCS)

@0.2 second timeout for fcmcs:
- Indigo is larger 59 times, with an average of 4.80 more atoms
- fcmcs is larger 25 times, with an average of 4.1 more atoms
- estimated time: 140 seconds (0.14 sec/MCS)

@0.5 seconds:
- Indigo is larger 37 times, with an average of 4.24 more atoms
- fcmcs is larger 27 times, with an average of 4.0 more atoms
- measured time: 355.84 seconds (0.356 sec/MCS)

@5.0 seconds:
- Indigo is larger 16 times, with an average of 2.5 more atoms
- fcmcs is larger 35 times, with an average of 3.4 more atoms
- rough estimated time: 2500 seconds (2.5 sec/MCS)

However, because I am now approximating the MCS solution, I
should also compare my solution against Indigo's approximate
solution. I have not yet done so.

I also tried 311 cases where the search set contains 100 nearest
neighbors with Tanimoto similarity at least 0.8 that of the randomly
selected fingerprint. Of those, Indigo timed out (after 30 seconds)
some 21 times.

Of the remaining 290 cases:

@0.2 seconds timeout for fcmcs:
- Indigo is larger 59 times, average 4.8 atoms larger
- fcmcs is larger 25 times, average 4.1 atoms larger

@0.5 seconds timeout for fcmcs:
- Indigo is larger 38 times, average 4.4 atoms larger
- fcmcs is larger 27 times, average 4.0 atoms larger

@5.0 seconds:
- Indigo is larger 16 times, average 2.5 atoms larger
- fcmcs is larger 35 times, average 3.4 atoms larger

I can provide any of the data sets if desired. I am also
in the process of setting up a test/regression set.

Cheers,

Andrew
da...@dalkescientific.com

This is the program I used to generate my timing data.

#!/usr/bin/env python
import sys
sys.path.insert(0, "/Users/dalke/ftps/indigo-python-1.1-beta10-mac")
filename, = sys.argv[1:]

import indigo
indigo = indigo.Indigo()
indigo.setOption("deconvolution-aromatization", True)

arr = indigo.createArray()

if filename.endswith(".smi"):
reader = indigo.iterateSmilesFile(filename)
elif filename.endswith(".sdf"):
reader = indigo.iterateSDFile(filename)
else:
raise AssertionError(filename)
for item in reader:
#item.aromatize()
arr.arrayAdd(item)

scaf = indigo.extractCommonScaffold(arr, "exact")

if scaf:
print scaf.smiles(), scaf.countAtoms(), "atoms", scaf.countBonds(), "bonds"
else:
print "No MCS found"

Savelyev Alexander

unread,

Apr 17, 2012, 3:32:50 AM4/17/12

to indigo-...@googlegroups.com

Hello Andrew,

Yes, of course, such a test comparison is very interesting for us. We
will be happy to provide any information required.

> Q: If I let RDKit generate the SMILES then how do I get Indigo
> to use the RDKit aromaticity expressed in the SMILES? Am I
> correct in thinking that I don't need to do anything for that case?

With the "deconvolution-aromatization" set to false, Indigo does not
modify the chemistry. With the flag set true, Indigo aromatizes
molecules. In this case, if an input molecule contains aromaticity, the
library tries to 'enlarge' the aromaticity information. Therefore, if a
SMILES contains arom-info, Indigo will use it in any cases, so yes, you
don't need to do anything.

> Q: What subgraph property does Indigo maximize for the MCS search?
> How are ties broken?

Indigo searches MCS in terms of a largest induced molecule1 subgraph
isomorphic to an induced molecule2 subgraph. Largest means that a result
subgraph can not be extended by the adding a vertex (atom). E.g. there
are three possible mcs for the molecules 'NCOCCCS' and 'NCCOCCS' which
can not be extended: 'CCOC', 'NC' and 'CCS'. Indigo stores all the
intermediate subgraphs, which can be a result MCS. Thus, after a
searching is finished, there is an array (all the solutions) of
submolecules available. In the end, the result molecules array is sorted
by the following rule: maximize rings number; if two molecules contain
equal ring number, then maximize bonds number. The result sorted
molecules array (all the maximum common submolecules) can be iterated by
the allScaffolds() method:

....

scaf = indigo.extractCommonScaffold(arr, "exact")

# print all the mcs scaffolds smiles
for scaffold in scaf.allScaffolds().iterateArray():
print(scaffold.smiles())
# print largest mcs scaffold smile
print (scaf.smiles())
...

If you call scaf.allScaffolds() then the sorted result array is returned.
If you call scaf.smiles() (or any method accepted for a molecule) then
the first element (maximum element, since the array is sorted) from the
array is returned

More examples can found here:
http://ggasoftware.com/opensource/indigo/concepts/deco
(RGroup Deco examples are outdated but the scaffold extraction examples
(Example 3) are actual)

> Q: How does Indigo tell if two atoms/bonds are equal?

The atoms/bonds matching is similar to yours. Indigo matches atoms based
on element and bonds based on order (single, double, triple, aromatic).
There is a note here: an aromatic bond can match only aromatic bond. I
suppose such a match rule was one of the possible reasons for the
specified MCS differences in your model.

>
> Q: Is there some way to specify an alternate match method
> in Indigo? If so, what match variations are supported?
>

Unfortunately, there is no way to specify match method. But it can be
simply added in the future releases. I am thinking about new parameters
for setting up the above options for extractCommonScaffold() method,
e.g. extractCommonScaffold(arr, "exact ignore_bond_types")

>
> Q: Does Indigo support a timeout settings? Is there some way to
> get its current best match once some time has been exceeded?

In the current Indigo version (1.1 beta 11) you can set timeout option:

indigo.setOption("timeout", milliseconds)

After extractCommonScaffold() is timed out, it throws an exception. But
there is no way to get its current best match. The algorithm does not
allow to do this, because it is based on the solution refining
(decreasing a MCS) during the molecules screening. Therefore, in
contrast with your algorithm, current matches are exact solutions only
for a part of input molecules, but may be not correct for the other part.

> I tested soft timeouts of 0.2, 0.5, and 5.0 seconds for fcmcs,
> rather than the hard rlimit-based solution I used for Indigo.
> This gives me a large common substructure, but not necessarily the
> MCS.
>
> I found that 5.0 seconds was a good compromise between performance
> and size. After running 1,000 comparisons I found:
>
> - Indigo time: 605.7 seconds with 56 timeouts (giving no MCSes)
> adding in 30 seconds per timeout gives 2286 seconds (2.29 sec/MCS)
>
> @0.2 second timeout for fcmcs:
> - Indigo is larger 59 times, with an average of 4.80 more atoms
> - fcmcs is larger 25 times, with an average of 4.1 more atoms
> - estimated time: 140 seconds (0.14 sec/MCS)
>
> @0.5 seconds:
> - Indigo is larger 37 times, with an average of 4.24 more atoms
> - fcmcs is larger 27 times, with an average of 4.0 more atoms
> - measured time: 355.84 seconds (0.356 sec/MCS)
>
> @5.0 seconds:
> - Indigo is larger 16 times, with an average of 2.5 more atoms
> - fcmcs is larger 35 times, with an average of 3.4 more atoms
> - rough estimated time: 2500 seconds (2.5 sec/MCS)
>
> However, because I am now approximating the MCS solution, I
> should also compare my solution against Indigo's approximate
> solution. I have not yet done so.

Indigo approximate algorithm should be much faster then the exact
algorithm. You can regulate iteration limit for the approximate
searching. The limit affects the MCS 'completeness'. The default
iteration limit is 1000. If you want to increase the iteration limit
(e.g. set to 2000) you just need to add a number into the parameters string.

extractCommonScaffold(arr, "approx 2000")

With best regards,
Alexander

Andrew Dalke

unread,

Apr 17, 2012, 6:56:39 AM4/17/12

to indigo-...@googlegroups.com

Hi Alexander,

> With the "deconvolution-aromatization" set to false, Indigo does not modify the chemistry. With the flag set true, Indigo aromatizes molecules. In this case, if an input molecule contains aromaticity, the library tries to 'enlarge' the aromaticity information. Therefore, if a SMILES contains arom-info, Indigo will use it in any cases, so yes, you don't need to do anything.

Thank you for confirming that.

On the topic of aromaticity, I noticed that the MCS of two structures like

c1ccccc1OC1CCCCC1
c1ccccc1N

returns a scaffold.smiles() of

C1:C:C:C:C:C:1

The output might be better written as 'c1:c:c:c:c:c:1'. In
that way the output SMILES can also be used as a SMARTS.

>> Q: What subgraph property does Indigo maximize for the MCS search?
>> How are ties broken?

> Indigo searches MCS in terms of a largest induced molecule1 subgraph isomorphic to an induced molecule2 subgraph. Largest means that a result subgraph can not be extended by the adding a vertex (atom).

...

> after a searching is finished, there is an array (all the solutions) of submolecules available. In the end, the result molecules array is sorted by the following rule: maximize rings number; if two molecules contain equal ring number, then maximize bonds number. The result sorted molecules array (all the maximum common submolecules) can be iterated by the allScaffolds() method:

Very clear explanation, thank you.

>> Q: How does Indigo tell if two atoms/bonds are equal?
> The atoms/bonds matching is similar to yours. Indigo matches atoms based on element and bonds based on order (single, double, triple, aromatic). There is a note here: an aromatic bond can match only aromatic bond. I suppose such a match rule was one of the possible reasons for the specified MCS differences in your model.

Yes, I have the same belief. Silly chemistry gets in the way of perfectly good graph theory. :)

I also have options to ignore bond type completely, or to allow ":" to also match both '-' and '=' bonds.

> Unfortunately, there is no way to specify match method. But it can be simply added in the future releases. I am thinking about new parameters for setting up the above options for extractCommonScaffold() method, e.g. extractCommonScaffold(arr, "exact ignore_bond_types")

The MultiMCS code published last year in JCIM describes a few different options which you might find of interest.

* Whether the discovered MCS must be connected (single component) or
may be disconnected (multicomponent).

* Whether or not ring bonds are allowed to match chain bonds in this MCS.

* Whether or not rings are allowed to match partially in this MCS.

* Whether or not the MCS found should include a specified seed
substructures, i.e., a substructural fragment that is required
by the user to be a part of the MCS.

* Different atom-typing schemes: a default scheme, an element-type-based
scheme, or any other user-specified scheme.

In addition, the SMSD paper describes some properties they found useful:

a. Specific matching of the chemical functional groups, bond types (aromatic,
non-aromatic, double, single etc.) and stereochemistry, e.g. phosphate, -SH,
nitrite etc. are identified and matched.

b. The resulting solutions are sorted in ascending order of the total bond
breaking energy (energy required to break the bonds between matched parts
and unmatched parts) required by this MCS match (i.e. lowest energy is
highest ranked).

c. The best set of solutions are chosen based on the above two steps and
the solutions are then sorted in decreasing order according to the number
of fragments generated if the matched part of the molecule is removed from
the reference structure (i.e. if a three member ring is matched to a single
ring structure then solutions which match the rings on the periphery are
preferred over the central ring).

Lastly, the OEChem algorithm lets you flag which atom and bond properties to compare.

http://www.eyesopen.com/docs/toolkits/current/html/OEChem_TK-c++/patternmatch.html#maximum-common-substructure-search

>>
>> Q: Does Indigo support a timeout settings? Is there some way to
>> get its current best match once some time has been exceeded?
> In the current Indigo version (1.1 beta 11) you can set timeout option:
>
> indigo.setOption("timeout", milliseconds)

Ah, I hadn't realized that. That option is listed in the Changelog for beta10, which probably explains why API documentation for the older, stable version doesn't mention it.

> After extractCommonScaffold() is timed out, it throws an exception. But there is no way to get its current best match. The algorithm does not allow to do this, because it is based on the solution refining (decreasing a MCS) during the molecules screening. Therefore, in contrast with your algorithm, current matches are exact solutions only for a part of input molecules, but may be not correct for the other part.

Mmmm, yes, I see the problem. I've only ever worked on the MCS problem using a backtracking algorithm, with the goal of only finding large MCSes and not all induced subgraphs. I can understand how that goal affects my thinking.

> Indigo approximate algorithm should be much faster then the exact algorithm. You can regulate iteration limit for the approximate searching. The limit affects the MCS 'completeness'. The default iteration limit is 1000. If you want to increase the iteration limit (e.g. set to 2000) you just need to add a number into the parameters string.
>
> extractCommonScaffold(arr, "approx 2000")

Speaking again of documentation, I saw no mention of the 'approx' option in the documentation. I only saw that "approx" was available by looking at the option parser in the code.

It looks like I'll be spending a few days to write a blog post with the results of my analysis. :) But first I have to finish my algorithm.

Andrew
da...@dalkescientific.com

Savelyev Alexander

unread,

Apr 18, 2012, 1:06:07 AM4/18/12

to indigo-...@googlegroups.com

Hi Andrew,

> The output might be better written as 'c1:c:c:c:c:c:1'. In
> that way the output SMILES can also be used as a SMARTS.
>

Yes, you are right. I will add this issue to our todo list.

Thank you very much for the useful links. I will consider the above
descriptions while implementing new match rules.

>
> Speaking again of documentation, I saw no mention of the 'approx' option in the documentation. I only saw that "approx" was available by looking at the option parser in the code.

Sorry for that. The documentation is updated usually after some beta
version period (may be after receiving a response from community)
I am going to update the docs very soon, add the new RGroup
Decomposition API and other implemented stuff.

> It looks like I'll be spending a few days to write a blog post with the results of my analysis. :) But first I have to finish my algorithm.
>

I wish you good luck in your work.

With best regards,
Alexander

Andrew Dalke

unread,

Apr 20, 2012, 7:39:25 AM4/20/12

to indigo-...@googlegroups.com

Hi all,

I've rerun my comparisons between the Indigo MCS and my new algorithm.

The summary version is that we both identify almost always the same MCS size. The reported differences in my previous analysis were due to using the wrong Indigo method to find the largest MCS and to a bug in my software. In this new comparison, out of 2622 cases where neither program timed out, there were 36 differences in size. 33 of those differences were because of the fullerene derivative CHEMBL439119, which I ignored, one was because of aromaticity perception, and I gave up trying to figure out the subtle chemistry differences in the other three. Also, Indigo is about three times faster than my code.

The protocol, as before, was to pick a structure at random from ChEMBL-13 and find the MCS between it and its k=10 neighbors with at least 0.8 Tanimoto similarity. I rejected groups with fewer than 5 structures total. I set a hard-cutoff of 30 CPU seconds on both programs, and for my algorithm did not report the best size when that time limit was reached.

I changed my Indigo code so it reports the scaffold with the largest number of atoms. Using Alexander's snippet as the base, I now iterate through all of the scaffolds to find the largest number of atoms, with ties broken in the number of bonds. My previous code used the default method, which optimizes the number of rings.

The difference in maximization algorithms was the reason for much of the difference in the two MCS sizes. There was also a bug in my code where I used the wrong method to test if a new common substructure was bigger than the current best size.

I also did some performance improvements to my code, which increased the performance by some 25% or so.

With those fixed, I am pleased to report that, of 3,000 test sets, there are:

2622 cases when both algorithms report the same size
3 cases when Indigo reports a smaller MCS (average, 2 atoms)
33 cases when fcmcs reports a smaller MCS (average, 1.6 atoms)
128 cases when both algorithms time out
60 cases when Indigo times out but fcmcs does not
154 cases when fcmcs times out but Indigo does not
===========
Total: 3000 cases

Indigo took 1700. wall-clock seconds
fcmcs took 4718 wall-clock seconds
ratio = 4718/1700. = 2.8

Of the 33 cases where fcmcs found a larger structure, 31 of them contained CHEMBL439119, which is a fullerene derivative. I wasn't going to stare at that many cycles. Also, one of the cases where Indigo reports a smaller MCS also includes that structure.

This leave us with 2 cases where Indigo reports a larger MCS and 2 cases where fcmcs reports a larger MCS.

In one case, Indigo reports an MCS which is 4 atoms larger. Those structures are:

% cat off_by_4.smi
N1(C(Cc2c(cccc2)C1)C(=O)NCc3nc4c([nH]3)cccc4)C(=O)C(Cc5c(cc(cc5C)O)C)N CHEMBL109390
C(=O)(C(N)Cc1c(cc(cc1C)O)C)N2C(Cc3c(cccc3)C2)C(=O)NCc4nc5c(n4CC6CC6)cccc5 CHEMBL173884
C(=O)(C(N)Cc1c(cc(cc1C)O)C)N2C(Cc3c(cccc3)C2)C(=O)NCc4nc5c(n4CC(=O)OCC)cccc5 CHEMBL197183
c12c(cccc1)nc([nH]2)C(CCCCNC(=O)OCc3ccccc3)NC(=O)C4N(Cc5c(cccc5)C4)C(=O)C(Cc6c(cc(cc6C)O)C)N CHEMBL212172
c1cc(c2c3c1CC4C5(C3(C(O2)c6n(c7c(c6C5)CCCC7)CC8CCCCC8)CCN4C)O)O CHEMBL222579
c1cc(c2c3c1CC4C5(C3(C(O2)c6n(c7c(c6C5)CCCC7)CC)CCN4C)O)O CHEMBL222704
c1cc(c2c3c1CC4C5(C3(C(O2)c6n(c7c(c6C5)CCCC7)CC8CC8)CCN4CC9CC9)O)O CHEMBL222597
c1cc(c2c3c1CC4C5(C3(C(O2)c6n(c7c(c6C5)CCCC7)CCc8ccccc8)CCN4C)O)O CHEMBL222814
C(=O)(C(N)Cc1c(cc(cc1C)O)C)N2C(Cc3c(cccc3)C2)C(=O)NCc4nc5c(n4CC(=O)N)cccc5 CHEMBL322300
[Pd]123[n+]4c5c(c6n1c(c(c7[n+]2c(c(c8n3c(c(c4C=C5)c9cc[n+](cc9)C)cc8)c1cc[n+](cc1)C)C=C7)c1cc[n+](cc1)C)cc6)c1cc[n+](cc1)C CHEMBL344583

% python indigo_mcs.py off_by_4.smi
N(C(C)CC(:C:C):C:C)CCC 12 atoms 11 bonds

% python fcmcs.py off_by_4.smi
c(:c):c(:c:c)-C-C-C 8 atoms 7 bonds
% python fcmcs.py --compare ignore-atom-aromaticity off_by_4.smi
[#6](:[#6]):[#6](-[#6]-[#6]-[#6]):[#6]:[#6] 8 atoms 7 bonds

The difference comes down to a difference in chemistry perception of the organometallic CHEMBL344583

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("[Pd]123[n+]4c5c(c6n1c(c(c7[n+]2c(c(c8n3c(c(c4C=C5)c9cc[n+](cc9)C)cc8)c1cc[n+](cc1)C)C=C7)c1cc[n+](cc1)C)cc6)c1cc[n+](cc1)C")
>>> fcmcs_pat = Chem.MolFromSmarts("c(:c):c(:c:c)-C-C-C")
>>> indigo_pat = Chem.MolFromSmarts("N(C(C)Cc(:c:c):c:c)CCC")
>>> mol.HasSubstructMatch(fcmcs_pat)
True
>>> mol.HasSubstructMatch(indigo_pat)
False

As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is

>>> pat2 = Chem.MolFromSmarts("[#7]([#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> mol.HasSubstructMatch(pat2)
True

The difference arises from aromaticity assignment. I'll change the '[#7]' to an 'N' and 'n' to show the difference according to RDKit:

>>> pat2 = Chem.MolFromSmarts("N([#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> mol.HasSubstructMatch(pat2)
False
>>> pat2 = Chem.MolFromSmarts("n([#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> mol.HasSubstructMatch(pat2)
True

Since my own MCS code ignores atom aromaticity, the difference between an N and n shouldn't matter. The actual difference is the bond type between the N and the first C (I use a '-' here and then a ':')

>>> pat2 = Chem.MolFromSmarts("n(-[#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> mol.HasSubstructMatch(pat2)
False
>>> pat2 = Chem.MolFromSmarts("n(:[#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> mol.HasSubstructMatch(pat2)
True

While in Indigo, with the same structure and patterns,

>>> mol = indigo.loadMolecule("[Pd]123[n+]4c5c(c6n1c(c(c7[n+]2c(c(c8n3c(c(c4C=C5)c9cc[n+](cc9)C)cc8)c1cc[n+](cc1)C)C=C7)c1cc[n+](cc1)C)cc6)c1cc[n+](cc1)C CHEMBL344583")
>>> matcher = indigo.substructureMatcher(mol)
>>> qmol = indigo.loadSmarts("[#7]([#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> matcher.countMatches(qmol)
16
>>> qmol = indigo.loadSmarts("[#7](:[#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> matcher.countMatches(qmol)
8
>>> qmol = indigo.loadSmarts("[#7](-[#6]([#6])[#6][#6](:[#6]:[#6]):[#6]:[#6])[#6][#6][#6]")
>>> matcher.countMatches(qmol)
8
>>>

The other case where Indigo reported a larger MCS (two atoms) was with

C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CN=C(NC#N)N)C4)O CHEMBL92022
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=O)NCCCC)C4)O CHEMBL107655
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=O)CC)C4)O CHEMBL110816
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=O)CCC)C4)O CHEMBL110487
n1(cc(c2c1cccc2)CC(NC(=O)CCCCCCCCCCCCCCC)C(CP(=O)(O)O)O)Cc3ncc(c(c3C)OC)C CHEMBL256855
n1(cc(c2c1cccc2)CC(NC(=O)CCCCCCCCCCCCCCC)C(CP(=O)(O)O)O)Cc3ccccc3 CHEMBL272955
C123c4c(c-5c6c(nc5c(n4CC1C2)O)c(c(c(c6)OC)OC)OC)C(=O)c7c3c(c([nH]7)C)C(=O)OC CHEMBL303615
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=O)NCC)C4)O CHEMBL321408
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=O)CCCC)C4)O CHEMBL324644
C123C4(C(N(CC1)CC5CC5)Cc6c2c(c(cc6)O)OC3c7c(c8c([nH]7)ccc(c8)CCNC(=N)N)C4)O CHEMBL328151

% python indigo_mcs.py off_by_2.smi
C:C(CC(CC)NCCCC):C:C 13 atoms 12 bonds
% fcmcs.py --compare ignore-atom-aromaticity off_by_2.smi
[#6](-[#6]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]:[#6])-[#6] 11 atoms 10 bonds
% python fcmcs.py --compare ignore-aromaticity off_by_2.smi --verbose
Read 10 structures from off_by_2.smi
[#6]-,:[#6](-,:[#6]-,:[#6](-,:,=[#7]-,:,=[#6]-,:,=[#6]-,:,=[#6]-,:,=[#6]-,:,=[#6]-,:,=[#6])-,:,=[#6])-,:[#6]-,:[#6](-,:[#7]-,:[#6]-,:[#6]-,:[#6])-,:[#6]-,:[#6]-,:,=[#6]-,:,=[#6] 22 atoms 21 bonds
Times: load 0.02 simplify 0.01 select 0.00 mcs 64.44 seconds

This again appears to be aromaticity related (the view is strengthened by the much larger MCS if aromaticity is completely ignored), but I didn't track it down like I did the previous case.

In any case, with only 3 differences remaining, I think it's okay to say that the differences are due to chemistry perception and not fundamental differences in the MCS perception algorithms or comparison methodology.

Cheers,

Andrew
da...@dalkescientific.com

Savelyev Alexander

unread,

Apr 20, 2012, 9:50:56 AM4/20/12

to indigo-...@googlegroups.com

Hi Andrew,

% python indigo_mcs.py off_by_4.smi
N(C(C)CC(:C:C):C:C)CCC 12 atoms 11 bonds

% python fcmcs.py off_by_4.smi
c(:c):c(:c:c)-C-C-C 8 atoms 7 bonds
% python fcmcs.py --compare ignore-atom-aromaticity off_by_4.smi
[#6](:[#6]):[#6](-[#6]-[#6]-[#6]):[#6]:[#6] 8 atoms 7 bonds

The difference comes down to a difference in chemistry perception of the organometallic CHEMBL344583

from rdkit import Chem
mol = Chem.MolFromSmiles("[Pd]123[n+]4c5c(c6n1c(c(c7[n+]2c(c(c8n3c(c(c4C=C5)c9cc[n+](cc9)C)cc8)c1cc[n+](cc1)C)C=C7)c1cc[n+](cc1)C)cc6)c1cc[n+](cc1)C")
fcmcs_pat = Chem.MolFromSmarts("c(:c):c(:c:c)-C-C-C")
indigo_pat = Chem.MolFromSmarts("N(C(C)Cc(:c:c):c:c)CCC")
mol.HasSubstructMatch(fcmcs_pat)

True

mol.HasSubstructMatch(indigo_pat)

False

As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is

It is not true. The extractCommonScaffold() considers atom aromaticity (':' means aromatic bond) and returns query scaffold as a SMILES but not as a SMARTS! So you should use loading from a SMILES (loadQueryMolecule() or loadQueryMoleculeFromFile() in indigo). For example:

--------------------------------------------------------------------------

indigo = Indigo()

mol = indigo.loadMolecule("[Pd]123[n+]4c5c(c6n1c(c(c7[n+]2c(c(c8n3c(c(c4C=C5)c9cc[n+](cc9)C)cc8)c1cc[n+](cc1)C)C=C7)c1cc[n+](cc1)C)cc6)c1cc[n+](cc1)C")

matcher = indigo.substructureMatcher(mol, "")

# SMARTS
qmol1 = indigo.loadSmarts("N(C(C)CC(:C:C):C:C)CCC")
match = matcher.match(qmol1)

if match != None:
print "true"
else:
print "false"

>>>>>> "false"

# SMILES
qmol2 = indigo.loadQueryMolecule("N(C(C)CC(:C:C):C:C)CCC")

match = matcher.match(qmol2)

if match != None:
print "true"
else:
print "false"

>>>>>> "true"
-------------------------------------------------------------------------------

Unfortunately, Indigo does not contain 'smarts()' method at the moment (but, it will be added someday).

By the way, in the previous letter:

The output might be better written as 'c1:c:c:c:c:c:1'. In
that way the output SMILES can also be used as a SMARTS.

I think, we will keep the current saving as it is. Because, there are examples where the above SMARTS and SMILES using at the same time can be unsafe. The simplest example is:

mol1 = 'c1ccc(cc1)-c1ccccc1'
mol2 = 'CC'.

The MCS is 'CC'. If you save it as 'cc' (no difference SMARTS or SMILES) then it does not match mol2. If you save it as 'CC' (SMARTS), then it does not match mol1. The example is correct if you save it as 'CC' (SMILES) or '[#6]-[#6]' (SMARTS).

Therefore, until smarts() method appears in the API, a MCS scaffold is correct only as SMILES or molfile (loaded by loadQueryMolecule())
----------------------------------------------------------------------

....
scaf = indigo.extractCommonScaffold(arr, "exact")

scaf_smiles = scaf.smiles()
scaf_mol = scaf.molfile()
....
qmol1 = indigo.loadQueryMolecule(scaf_smiles)
.....
qmol2 = indigo.loadQueryMolecule(scaf_mol)
....

----------------------------------------------------------------------

With best regards,
Alexander

Andrew Dalke

unread,

Apr 20, 2012, 10:17:27 AM4/20/12

to indigo-...@googlegroups.com

Hi Alexander,

>> As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is
>>
>>
>
> It is not true. The extractCommonScaffold() considers atom aromaticity (':' means aromatic bond) and returns query scaffold as a SMILES but not as a SMARTS!

I should have explained that better. In the cheminformatics toolkits I've most experienced with, atoms have an aromaticity flag, and bonds have an aromaticity flag.

It looks like Indigo's scaffold detection code says that two atoms match if the element matches, but it doesn't check to see if that atom aromaticity matches. It absolutely does check that bond aromaticities match.

In my default comparison, I check that both the atoms' element and aromaticity flag match.

Here's an example of the difference:

% cat > x.smi
c1ccccc1O phenol
CO carbon-monoxide

% python indigo_mcs.py x.smi
CO 2 atoms 1 bonds

% python fcmcs.py --compare ignore-atom-aromaticity x.smi
[#6]-[#8] 2 atoms 1 bonds

% python fcmcs.py x.smi
No MCS found

I don't know if this is chemically meaningful or appropriate. It's something I'm going to ask my client.

For myself I've been using the term 'atom aromaticity' vs. 'bond aromaticity' to distinguish between the two.

Do you have a suggesting for an alternative, less confusing phrase?

>> The output might be better written as 'c1:c:c:c:c:c:1'. In
>> that way the output SMILES can also be used as a SMARTS.
> I think, we will keep the current saving as it is. Because, there are examples where the above SMARTS and SMILES using at the same time can be unsafe. The simplest example is:
>
> mol1 = 'c1ccc(cc1)-c1ccccc1'
> mol2 = 'CC'.

Yes, I agree with you.

In looking at MCS outputs from some other programs, I see other cases where the output fragment SMILES cannot correctly be interpreted as a SMARTS because of the ambiguity in how to interpret the bond between two adjacent atoms.

In any case, I believe the difference between Indigo and RDKit regarding that one Pt structure is caused by differences in chemistry perception.

Cheers,

Andrew
da...@dalkescientific.com

Savelyev Alexander

unread,

Apr 24, 2012, 11:03:49 AM4/24/12

to indigo-...@googlegroups.com

Hi Andrew,

Thanks for the explanations. I can not suggest any alternatives. I
haven't noticed the 'atom' word in 'atom aromaticity' because I have
faced with an atom aromaticity constraint only while a substructure
searching using SMARTS queries. For me, the scaffold searching is
something different, since it is defined for simple molecules. But you
are right, here we can easily add an additional matching rule for atoms.
I think such a rule is chemically meaningful but it is a very strict rule.

I can share my experience for using scaffold extraction results. The
next usual chemical operation is to decompose molecules into a scaffold
and RGroups (RGroup Decomposition). It is very important to have a most
maximum scaffold to extract a most common molecules kernel for further
chemical researches. Therefore, a lot of chemical information (e.g.
stereocenters, charges, etc) is ignored. Moreover, I saw the real usage
for the scaffold detection by a chemist, and there was a full electrons
discard (ignoring atoms and bonds, only a scaffold skeleton was used).

Please, let us know if you discover appropriate examples. Anyway, thanks
for the issue. I think, the atom aromaticity constraint will appear as
an option.

With best regards,
Alexander

Andrew Dalke

unread,

Apr 28, 2012, 6:17:32 PM4/28/12

to indigo-...@googlegroups.com

On Apr 17, 2012, at 9:32 AM, Savelyev Alexander wrote:
> In the current Indigo version (1.1 beta 11) you can set timeout option:
>
> indigo.setOption("timeout", milliseconds)
>
> After extractCommonScaffold() is timed out, it throws an exception. But there is no way to get its current best match. The algorithm does not allow to do this, because it is based on the solution refining (decreasing a MCS) during the molecules screening. Therefore, in contrast with your algorithm, current matches are exact solutions only for a part of input molecules, but may be not correct for the other part.

I'm having some problems with the timeout setting. It doesn't always seem to work.

In one test case, I set the timeout to 5 seconds but it took 30 seconds until it raised an exception. However, if I used a 60 second timeout then it stopped at 60.02 seconds. I'm going to hazard that there's a setup stage which doesn't have a timeout, and a search stage which does have a timeout?

In the other case I've found, I have a 60 second timeout and it's been working on finding the scaffold for the last few minutes.

The structures come up from random pairings of ChEMBL data. (I do realize this is not the expected use of the scaffold code.) This one which has taken 30+ minutes comes from trying to find the scaffold between:

CHEMBL1091306 - https://www.ebi.ac.uk/chembldb/compound/inspect/CHEMBL1091306
CHEMBL446018 - https://www.ebi.ac.uk/chembldb/compound/inspect/CHEMBL446018

They are admittedly some rather large structures. My own MCS program times out after 60 seconds with a common structure containing 61 atoms and 60 bonds.

Still, would it be possible to have the timeout checking work for this case?

Cheers,

Andrew
da...@dalkescientific.com

Andrew Dalke

unread,

Apr 29, 2012, 12:42:05 AM4/29/12

to indigo-...@googlegroups.com

On Apr 29, 2012, at 12:17 AM, Andrew Dalke wrote:
> This one which has taken 30+ minutes comes from trying to find the scaffold between:
>
> CHEMBL1091306 - https://www.ebi.ac.uk/chembldb/compound/inspect/CHEMBL1091306
> CHEMBL446018 - https://www.ebi.ac.uk/chembldb/compound/inspect/CHEMBL446018

It finished! It took 23087 seconds to time out, rather than the 60 I specified.

Andrew
da...@dalkescientific.com

Savelyev Alexander

unread,

May 2, 2012, 3:17:21 AM5/2/12

to indigo-...@googlegroups.com, indigo-...@googlegroups.com

Hi Andrew,

Thanks for the bug report. The fixed version (1.1 rc) is already
available to download.

With best regards,
Alexander

Andrew Dalke

unread,

May 2, 2012, 1:26:52 PM5/2/12

to indigo-...@googlegroups.com

Hi Alexander,

On May 2, 2012, at 9:17 AM, Savelyev Alexander wrote:
> Thanks for the bug report. The fixed version (1.1 rc) is already available to download.

I've tested it and confirmed that it now times out exactly when expected.

Thanks!

Andrew
da...@dalkescientific.com

Reply all

Reply to author

Forward