I've been developing a new MCS algorithm to find the MCS of a set of structures. It is (currently) called 'fcmcs' and uses RDKit for the chemistry. It uses a rather different approach than the standard clique approach. It's currently written in Python.
I'm at the stage where I'm testing it against other MCS tools. I tested my code against the pairwise MCS algorithm of SMSD http://www.ebi.ac.uk/thornton-srv/software/SMSD/ last Friday and Saturday, and yesterday I started testing both the pairwise and the multiple structure cases with Indigo.
I have some questions (marked with "Q:") regarding the Indigo implementation of extractCommonScaffold. They are inline with my description of the comparison I've been doing. I figured that some of the Indigo developers may be interested in the comparison numbers.
My protocol is: - take ChEMBL-13 and make a new data set with only the largest fragments - save the new structures as a SMILES (Indigo points out there is contradictory stereo information in some of the ChEMBL-13 data files, and refuses to load those structures. I used another tool for the conversion, and I presume it ignores that problem.)
For the pairwise comparison: - extract two structures at random - find the MCS using the two algorithms under test - set a limit of 30s of CPU time
For the group comparison: - generate RDKit fingerprints for the ChEMBL-13 data set - pick a fingerprint at random - find the k=10 (or later, 100) nearest structures with at least 0.80 Tanimoto similarity; there must be at least 5 compounds in the data set. There may be fewer than k. - find the MCS of that data set
I am pleased to report that Indigo's extractCommonScaffold is usually very fast, which I think you all already knew.
Using the pairwise protocol above, I tested 3219 pairs (excluding the one Indigo case and two fcmcs cases which timed out at 30s).
- Indigo took 60.1 seconds, of which the longest was 0.83 seconds - fcmcs took 350. seconds, of which 5 took over 10 seconds and 38 took over 1 second.
(This time is biased. I have removed 0.109s that it takes for Indigo to find the MCS of "CC" against itself, and the 0.219s it takes for fcmcs to do the same. This should minimize the impact of startup costs. I processed one MCS calculation per run so I could use rlimit to enforce an MCS search timeout for each run. The program I used is at the end of this email.)
Of those 3219 cases: - Indigo found a larger MCS (more atoms) 177 times (5.5%), on average 2.34 atoms larger than what fcmcs found - fcmcs found a larger MCS 618 times (19%), on average 2.36 atoms larger
I can think of several possible reasons for these differences:
1) RDKit and Indigo have different aromaticity models. I observed with SMSD that 7.2% of the time it found a average 2.1 atom larger MCS, and 6.7% of the time fcmcs found an average 2.1 atom larger MCS. SMSD uses CDK for its chemistry model.
I set the "deconvolution-aromatization" flag to True, so both Indigo and RDKit are perceiving their own respective aromaticity. In my next analysis I'll have RDKit generate the SMILES and let Indigo use the process SMILES as input, because I believe Indigo does not modify the chemistry upon input.
Q: If I let RDKit generate the SMILES then how do I get Indigo to use the RDKit aromaticity expressed in the SMILES? Am I correct in thinking that I don't need to do anything for that case?
2) I believe I read in an email posting that extractCommonScaffold() maximizes the number of rings in the MCS, while I've configured fcmcs here to maximize the number of atoms, with ties broken on the number of bonds. There are obvious cases where the largest MCS does not contain a ring while a smaller common substructure does.
Q: What subgraph property does Indigo maximize for the MCS search? How are ties broken?
3) I configured my code to match atoms based on element and match bonds based on atom type (single, aromatic, double, and triple). That gave the best match to the Indigo results. However, I was not able to find documentation for this.
Q: How does Indigo tell if two atoms/bonds are equal?
BTW, some of the other variations I've come across are: - ignore bond types completely (this is the default for SMSD) - ignore bond types and elements (this is a topology-only MCS) - assume that aromatic atoms can also match single and double bonds - ring bonds can only match ring bonds; non-ring bonds can only match non-ring bonds
Q: Is there some way to specify an alternate match method in Indigo? If so, what match variations are supported?
With the pair-wise comparison complete, I went onto the group MCS comparison. The method I chose gives highly similar structures.
Out of 201 comparisons: - Average MCS size: 22.1 atoms for Indigo, 22.2 atoms for fcmcs
- Indigo time: 178.7 seconds with 9 timeouts (giving no MCSes) adding in 30 seconds per timeout gives 448.7 seconds (2.23 sec/MCS) - fcmcs time: 443.7 seconds with 52 timeouts (but a large common substructure is known) adding in the timeouts gives 2008. seconds (10. sec/MCSes) ==> Indigo is about 5x faster than fcmcs
- Indigo finds a larger substructure (more atoms) 3 times, with 1 atom each time - fcmcs finds a larger substructure 8 times, with an average of 3.6 more atoms => 95% of the time they report the same values
My algorithm supports a 'timeout' value. When it notices that the timeout has been reached, it will report the best structure seen to date. I observed that my algorithm often finds the MCS within the first few seconds then spends a lot of proving that alternatives cannot be better, or only increasing the MCS slightly.
Thus, I could use a soft timeout (meaning, implemented by the code and not by the OS) of 30 seconds and always get some decent common structure.
Q: Does Indigo support a timeout settings? Is there some way to get its current best match once some time has been exceeded?
I tested soft timeouts of 0.2, 0.5, and 5.0 seconds for fcmcs, rather than the hard rlimit-based solution I used for Indigo. This gives me a large common substructure, but not necessarily the MCS.
I found that 5.0 seconds was a good compromise between performance and size. After running 1,000 comparisons I found:
- Indigo time: 605.7 seconds with 56 timeouts (giving no MCSes) adding in 30 seconds per timeout gives 2286 seconds (2.29 sec/MCS)
@0.2 second timeout for fcmcs: - Indigo is larger 59 times, with an average of 4.80 more atoms - fcmcs is larger 25 times, with an average of 4.1 more atoms - estimated time: 140 seconds (0.14 sec/MCS)
@0.5 seconds: - Indigo is larger 37 times, with an average of 4.24 more atoms - fcmcs is larger 27 times, with an average of 4.0 more atoms - measured time: 355.84 seconds (0.356 sec/MCS)
@5.0 seconds: - Indigo is larger 16 times, with an average of 2.5 more atoms - fcmcs is larger 35 times, with an average of 3.4 more atoms - rough estimated time: 2500 seconds (2.5 sec/MCS)
However, because I am now approximating the MCS solution, I should also compare my solution against Indigo's approximate solution. I have not yet done so.
I also tried 311 cases where the search set contains 100 nearest neighbors with Tanimoto similarity at least 0.8 that of the randomly selected fingerprint. Of those, Indigo timed out (after 30 seconds) some 21 times.
Of the remaining 290 cases:
@0.2 seconds timeout for fcmcs: - Indigo is larger 59 times, average 4.8 atoms larger - fcmcs is larger 25 times, average 4.1 atoms larger
@0.5 seconds timeout for fcmcs: - Indigo is larger 38 times, average 4.4 atoms larger - fcmcs is larger 27 times, average 4.0 atoms larger
@5.0 seconds: - Indigo is larger 16 times, average 2.5 atoms larger - fcmcs is larger 35 times, average 3.4 atoms larger
I can provide any of the data sets if desired. I am also in the process of setting up a test/regression set.
Yes, of course, such a test comparison is very interesting for us. We will be happy to provide any information required.
> Q: If I let RDKit generate the SMILES then how do I get Indigo > to use the RDKit aromaticity expressed in the SMILES? Am I > correct in thinking that I don't need to do anything for that case?
With the "deconvolution-aromatization" set to false, Indigo does not modify the chemistry. With the flag set true, Indigo aromatizes molecules. In this case, if an input molecule contains aromaticity, the library tries to 'enlarge' the aromaticity information. Therefore, if a SMILES contains arom-info, Indigo will use it in any cases, so yes, you don't need to do anything.
> Q: What subgraph property does Indigo maximize for the MCS search? > How are ties broken?
Indigo searches MCS in terms of a largest induced molecule1 subgraph isomorphic to an induced molecule2 subgraph. Largest means that a result subgraph can not be extended by the adding a vertex (atom). E.g. there are three possible mcs for the molecules 'NCOCCCS' and 'NCCOCCS' which can not be extended: 'CCOC', 'NC' and 'CCS'. Indigo stores all the intermediate subgraphs, which can be a result MCS. Thus, after a searching is finished, there is an array (all the solutions) of submolecules available. In the end, the result molecules array is sorted by the following rule: maximize rings number; if two molecules contain equal ring number, then maximize bonds number. The result sorted molecules array (all the maximum common submolecules) can be iterated by the allScaffolds() method:
.... scaf = indigo.extractCommonScaffold(arr, "exact") # print all the mcs scaffolds smiles for scaffold in scaf.allScaffolds().iterateArray(): print(scaffold.smiles()) # print largest mcs scaffold smile print (scaf.smiles()) ...
If you call scaf.allScaffolds() then the sorted result array is returned. If you call scaf.smiles() (or any method accepted for a molecule) then the first element (maximum element, since the array is sorted) from the array is returned
> Q: How does Indigo tell if two atoms/bonds are equal?
The atoms/bonds matching is similar to yours. Indigo matches atoms based on element and bonds based on order (single, double, triple, aromatic). There is a note here: an aromatic bond can match only aromatic bond. I suppose such a match rule was one of the possible reasons for the specified MCS differences in your model.
> Q: Is there some way to specify an alternate match method > in Indigo? If so, what match variations are supported?
Unfortunately, there is no way to specify match method. But it can be simply added in the future releases. I am thinking about new parameters for setting up the above options for extractCommonScaffold() method, e.g. extractCommonScaffold(arr, "exact ignore_bond_types")
> Q: Does Indigo support a timeout settings? Is there some way to > get its current best match once some time has been exceeded?
In the current Indigo version (1.1 beta 11) you can set timeout option:
indigo.setOption("timeout", milliseconds)
After extractCommonScaffold() is timed out, it throws an exception. But there is no way to get its current best match. The algorithm does not allow to do this, because it is based on the solution refining (decreasing a MCS) during the molecules screening. Therefore, in contrast with your algorithm, current matches are exact solutions only for a part of input molecules, but may be not correct for the other part.
> I tested soft timeouts of 0.2, 0.5, and 5.0 seconds for fcmcs, > rather than the hard rlimit-based solution I used for Indigo. > This gives me a large common substructure, but not necessarily the > MCS.
> I found that 5.0 seconds was a good compromise between performance > and size. After running 1,000 comparisons I found:
> - Indigo time: 605.7 seconds with 56 timeouts (giving no MCSes) > adding in 30 seconds per timeout gives 2286 seconds (2.29 sec/MCS)
> @0.2 second timeout for fcmcs: > - Indigo is larger 59 times, with an average of 4.80 more atoms > - fcmcs is larger 25 times, with an average of 4.1 more atoms > - estimated time: 140 seconds (0.14 sec/MCS)
> @0.5 seconds: > - Indigo is larger 37 times, with an average of 4.24 more atoms > - fcmcs is larger 27 times, with an average of 4.0 more atoms > - measured time: 355.84 seconds (0.356 sec/MCS)
> @5.0 seconds: > - Indigo is larger 16 times, with an average of 2.5 more atoms > - fcmcs is larger 35 times, with an average of 3.4 more atoms > - rough estimated time: 2500 seconds (2.5 sec/MCS)
> However, because I am now approximating the MCS solution, I > should also compare my solution against Indigo's approximate > solution. I have not yet done so.
Indigo approximate algorithm should be much faster then the exact algorithm. You can regulate iteration limit for the approximate searching. The limit affects the MCS 'completeness'. The default iteration limit is 1000. If you want to increase the iteration limit (e.g. set to 2000) you just need to add a number into the parameters string.
> With the "deconvolution-aromatization" set to false, Indigo does not modify the chemistry. With the flag set true, Indigo aromatizes molecules. In this case, if an input molecule contains aromaticity, the library tries to 'enlarge' the aromaticity information. Therefore, if a SMILES contains arom-info, Indigo will use it in any cases, so yes, you don't need to do anything.
Thank you for confirming that.
On the topic of aromaticity, I noticed that the MCS of two structures like
c1ccccc1OC1CCCCC1 c1ccccc1N
returns a scaffold.smiles() of
C1:C:C:C:C:C:1
The output might be better written as 'c1:c:c:c:c:c:1'. In that way the output SMILES can also be used as a SMARTS.
>> Q: What subgraph property does Indigo maximize for the MCS search? >> How are ties broken? > Indigo searches MCS in terms of a largest induced molecule1 subgraph isomorphic to an induced molecule2 subgraph. Largest means that a result subgraph can not be extended by the adding a vertex (atom). ... > after a searching is finished, there is an array (all the solutions) of submolecules available. In the end, the result molecules array is sorted by the following rule: maximize rings number; if two molecules contain equal ring number, then maximize bonds number. The result sorted molecules array (all the maximum common submolecules) can be iterated by the allScaffolds() method:
Very clear explanation, thank you.
>> Q: How does Indigo tell if two atoms/bonds are equal? > The atoms/bonds matching is similar to yours. Indigo matches atoms based on element and bonds based on order (single, double, triple, aromatic). There is a note here: an aromatic bond can match only aromatic bond. I suppose such a match rule was one of the possible reasons for the specified MCS differences in your model.
Yes, I have the same belief. Silly chemistry gets in the way of perfectly good graph theory. :)
I also have options to ignore bond type completely, or to allow ":" to also match both '-' and '=' bonds.
> Unfortunately, there is no way to specify match method. But it can be simply added in the future releases. I am thinking about new parameters for setting up the above options for extractCommonScaffold() method, e.g. extractCommonScaffold(arr, "exact ignore_bond_types")
The MultiMCS code published last year in JCIM describes a few different options which you might find of interest.
* Whether the discovered MCS must be connected (single component) or may be disconnected (multicomponent).
* Whether or not ring bonds are allowed to match chain bonds in this MCS.
* Whether or not rings are allowed to match partially in this MCS.
* Whether or not the MCS found should include a specified seed substructures, i.e., a substructural fragment that is required by the user to be a part of the MCS.
* Different atom-typing schemes: a default scheme, an element-type-based scheme, or any other user-specified scheme.
In addition, the SMSD paper describes some properties they found useful:
a. Specific matching of the chemical functional groups, bond types (aromatic, non-aromatic, double, single etc.) and stereochemistry, e.g. phosphate, -SH, nitrite etc. are identified and matched.
b. The resulting solutions are sorted in ascending order of the total bond breaking energy (energy required to break the bonds between matched parts and unmatched parts) required by this MCS match (i.e. lowest energy is highest ranked).
c. The best set of solutions are chosen based on the above two steps and the solutions are then sorted in decreasing order according to the number of fragments generated if the matched part of the molecule is removed from the reference structure (i.e. if a three member ring is matched to a single ring structure then solutions which match the rings on the periphery are preferred over the central ring).
Lastly, the OEChem algorithm lets you flag which atom and bond properties to compare.
>> Q: Does Indigo support a timeout settings? Is there some way to >> get its current best match once some time has been exceeded? > In the current Indigo version (1.1 beta 11) you can set timeout option:
> indigo.setOption("timeout", milliseconds)
Ah, I hadn't realized that. That option is listed in the Changelog for beta10, which probably explains why API documentation for the older, stable version doesn't mention it.
> After extractCommonScaffold() is timed out, it throws an exception. But there is no way to get its current best match. The algorithm does not allow to do this, because it is based on the solution refining (decreasing a MCS) during the molecules screening. Therefore, in contrast with your algorithm, current matches are exact solutions only for a part of input molecules, but may be not correct for the other part.
Mmmm, yes, I see the problem. I've only ever worked on the MCS problem using a backtracking algorithm, with the goal of only finding large MCSes and not all induced subgraphs. I can understand how that goal affects my thinking.
> Indigo approximate algorithm should be much faster then the exact algorithm. You can regulate iteration limit for the approximate searching. The limit affects the MCS 'completeness'. The default iteration limit is 1000. If you want to increase the iteration limit (e.g. set to 2000) you just need to add a number into the parameters string.
> extractCommonScaffold(arr, "approx 2000")
Speaking again of documentation, I saw no mention of the 'approx' option in the documentation. I only saw that "approx" was available by looking at the option parser in the code.
It looks like I'll be spending a few days to write a blog post with the results of my analysis. :) But first I have to finish my algorithm.
> The MultiMCS code published last year in JCIM describes a few different options which you might find of interest.
> * Whether the discovered MCS must be connected (single component) or > may be disconnected (multicomponent).
> * Whether or not ring bonds are allowed to match chain bonds in this MCS.
> * Whether or not rings are allowed to match partially in this MCS.
> * Whether or not the MCS found should include a specified seed > substructures, i.e., a substructural fragment that is required > by the user to be a part of the MCS.
> * Different atom-typing schemes: a default scheme, an element-type-based > scheme, or any other user-specified scheme.
> In addition, the SMSD paper describes some properties they found useful:
> a. Specific matching of the chemical functional groups, bond types (aromatic, > non-aromatic, double, single etc.) and stereochemistry, e.g. phosphate, -SH, > nitrite etc. are identified and matched.
> b. The resulting solutions are sorted in ascending order of the total bond > breaking energy (energy required to break the bonds between matched parts > and unmatched parts) required by this MCS match (i.e. lowest energy is > highest ranked).
> c. The best set of solutions are chosen based on the above two steps and > the solutions are then sorted in decreasing order according to the number > of fragments generated if the matched part of the molecule is removed from > the reference structure (i.e. if a three member ring is matched to a single > ring structure then solutions which match the rings on the periphery are > preferred over the central ring).
> Lastly, the OEChem algorithm lets you flag which atom and bond properties to compare.
Thank you very much for the useful links. I will consider the above descriptions while implementing new match rules.
> Speaking again of documentation, I saw no mention of the 'approx' option in the documentation. I only saw that "approx" was available by looking at the option parser in the code.
Sorry for that. The documentation is updated usually after some beta version period (may be after receiving a response from community) I am going to update the docs very soon, add the new RGroup Decomposition API and other implemented stuff.
> It looks like I'll be spending a few days to write a blog post with the results of my analysis. :) But first I have to finish my algorithm.
I've rerun my comparisons between the Indigo MCS and my new algorithm.
The summary version is that we both identify almost always the same MCS size. The reported differences in my previous analysis were due to using the wrong Indigo method to find the largest MCS and to a bug in my software. In this new comparison, out of 2622 cases where neither program timed out, there were 36 differences in size. 33 of those differences were because of the fullerene derivative CHEMBL439119, which I ignored, one was because of aromaticity perception, and I gave up trying to figure out the subtle chemistry differences in the other three. Also, Indigo is about three times faster than my code.
The protocol, as before, was to pick a structure at random from ChEMBL-13 and find the MCS between it and its k=10 neighbors with at least 0.8 Tanimoto similarity. I rejected groups with fewer than 5 structures total. I set a hard-cutoff of 30 CPU seconds on both programs, and for my algorithm did not report the best size when that time limit was reached.
I changed my Indigo code so it reports the scaffold with the largest number of atoms. Using Alexander's snippet as the base, I now iterate through all of the scaffolds to find the largest number of atoms, with ties broken in the number of bonds. My previous code used the default method, which optimizes the number of rings.
The difference in maximization algorithms was the reason for much of the difference in the two MCS sizes. There was also a bug in my code where I used the wrong method to test if a new common substructure was bigger than the current best size.
I also did some performance improvements to my code, which increased the performance by some 25% or so.
With those fixed, I am pleased to report that, of 3,000 test sets, there are:
2622 cases when both algorithms report the same size 3 cases when Indigo reports a smaller MCS (average, 2 atoms) 33 cases when fcmcs reports a smaller MCS (average, 1.6 atoms) 128 cases when both algorithms time out 60 cases when Indigo times out but fcmcs does not 154 cases when fcmcs times out but Indigo does not =========== Total: 3000 cases
Indigo took 1700. wall-clock seconds fcmcs took 4718 wall-clock seconds ratio = 4718/1700. = 2.8
Of the 33 cases where fcmcs found a larger structure, 31 of them contained CHEMBL439119, which is a fullerene derivative. I wasn't going to stare at that many cycles. Also, one of the cases where Indigo reports a smaller MCS also includes that structure.
This leave us with 2 cases where Indigo reports a larger MCS and 2 cases where fcmcs reports a larger MCS.
In one case, Indigo reports an MCS which is 4 atoms larger. Those structures are:
Since my own MCS code ignores atom aromaticity, the difference between an N and n shouldn't matter. The actual difference is the bond type between the N and the first C (I use a '-' here and then a ':')
This again appears to be aromaticity related (the view is strengthened by the much larger MCS if aromaticity is completely ignored), but I didn't track it down like I did the previous case.
In any case, with only 3 differences remaining, I think it's okay to say that the differences are due to chemistry perception and not fundamental differences in the MCS perception algorithms or comparison methodology.
> As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is
It is not true. The extractCommonScaffold() considers atom aromaticity (':' means aromatic bond) and returns query scaffold as a SMILES but not as a SMARTS! So you should use loading from a SMILES (loadQueryMolecule() or loadQueryMoleculeFromFile() in indigo). For example:
Unfortunately, Indigo does not contain 'smarts()' method at the moment (but, it will be added someday).
By the way, in the previous letter:
> The output might be better written as 'c1:c:c:c:c:c:1'. In > that way the output SMILES can also be used as a SMARTS.
I think, we will keep the current saving as it is. Because, there are examples where the above SMARTS and SMILES using at the same time can be unsafe. The simplest example is:
mol1 = 'c1ccc(cc1)-c1ccccc1' mol2 = 'CC'.
The MCS is 'CC'. If you save it as 'cc' (no difference SMARTS or SMILES) then it does not match mol2. If you save it as 'CC' (SMARTS), then it does not match mol1. The example is correct if you save it as 'CC' (SMILES) or '[#6]-[#6]' (SMARTS).
Therefore, until smarts() method appears in the API, a MCS scaffold is correct only as SMILES or molfile (loaded by loadQueryMolecule()) ----------------------------------------------------------------------
>> As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is
> It is not true. The extractCommonScaffold() considers atom aromaticity (':' means aromatic bond) and returns query scaffold as a SMILES but not as a SMARTS!
I should have explained that better. In the cheminformatics toolkits I've most experienced with, atoms have an aromaticity flag, and bonds have an aromaticity flag.
It looks like Indigo's scaffold detection code says that two atoms match if the element matches, but it doesn't check to see if that atom aromaticity matches. It absolutely does check that bond aromaticities match.
In my default comparison, I check that both the atoms' element and aromaticity flag match.
I don't know if this is chemically meaningful or appropriate. It's something I'm going to ask my client.
For myself I've been using the term 'atom aromaticity' vs. 'bond aromaticity' to distinguish between the two.
Do you have a suggesting for an alternative, less confusing phrase?
>> The output might be better written as 'c1:c:c:c:c:c:1'. In >> that way the output SMILES can also be used as a SMARTS. > I think, we will keep the current saving as it is. Because, there are examples where the above SMARTS and SMILES using at the same time can be unsafe. The simplest example is:
> mol1 = 'c1ccc(cc1)-c1ccccc1' > mol2 = 'CC'.
Yes, I agree with you.
In looking at MCS outputs from some other programs, I see other cases where the output fragment SMILES cannot correctly be interpreted as a SMARTS because of the ambiguity in how to interpret the bond between two adjacent atoms.
In any case, I believe the difference between Indigo and RDKit regarding that one Pt structure is caused by differences in chemistry perception.
Thanks for the explanations. I can not suggest any alternatives. I haven't noticed the 'atom' word in 'atom aromaticity' because I have faced with an atom aromaticity constraint only while a substructure searching using SMARTS queries. For me, the scaffold searching is something different, since it is defined for simple molecules. But you are right, here we can easily add an additional matching rule for atoms. I think such a rule is chemically meaningful but it is a very strict rule.
I can share my experience for using scaffold extraction results. The next usual chemical operation is to decompose molecules into a scaffold and RGroups (RGroup Decomposition). It is very important to have a most maximum scaffold to extract a most common molecules kernel for further chemical researches. Therefore, a lot of chemical information (e.g. stereocenters, charges, etc) is ignored. Moreover, I saw the real usage for the scaffold detection by a chemist, and there was a full electrons discard (ignoring atoms and bonds, only a scaffold skeleton was used).
Please, let us know if you discover appropriate examples. Anyway, thanks for the issue. I think, the atom aromaticity constraint will appear as an option.
>>> As Indigo's scaffold search ignores atom aromaticity, the correct SMARTS equivalent is
>> It is not true. The extractCommonScaffold() considers atom aromaticity (':' means aromatic bond) and returns query scaffold as a SMILES but not as a SMARTS! > I should have explained that better. In the cheminformatics toolkits I've most experienced with, atoms have an aromaticity flag, and bonds have an aromaticity flag.
> It looks like Indigo's scaffold detection code says that two atoms match if the element matches, but it doesn't check to see if that atom aromaticity matches. It absolutely does check that bond aromaticities match.
> In my default comparison, I check that both the atoms' element and aromaticity flag match.
> Here's an example of the difference:
> % cat> x.smi > c1ccccc1O phenol > CO carbon-monoxide
> I don't know if this is chemically meaningful or appropriate. It's something I'm going to ask my client.
> For myself I've been using the term 'atom aromaticity' vs. 'bond aromaticity' to distinguish between the two.
> Do you have a suggesting for an alternative, less confusing phrase?
>>> The output might be better written as 'c1:c:c:c:c:c:1'. In >>> that way the output SMILES can also be used as a SMARTS. >> I think, we will keep the current saving as it is. Because, there are examples where the above SMARTS and SMILES using at the same time can be unsafe. The simplest example is:
>> mol1 = 'c1ccc(cc1)-c1ccccc1' >> mol2 = 'CC'. > Yes, I agree with you.
> In looking at MCS outputs from some other programs, I see other cases where the output fragment SMILES cannot correctly be interpreted as a SMARTS because of the ambiguity in how to interpret the bond between two adjacent atoms.
> In any case, I believe the difference between Indigo and RDKit regarding that one Pt structure is caused by differences in chemistry perception.
On Apr 17, 2012, at 9:32 AM, Savelyev Alexander wrote:
> In the current Indigo version (1.1 beta 11) you can set timeout option:
> indigo.setOption("timeout", milliseconds)
> After extractCommonScaffold() is timed out, it throws an exception. But there is no way to get its current best match. The algorithm does not allow to do this, because it is based on the solution refining (decreasing a MCS) during the molecules screening. Therefore, in contrast with your algorithm, current matches are exact solutions only for a part of input molecules, but may be not correct for the other part.
I'm having some problems with the timeout setting. It doesn't always seem to work.
In one test case, I set the timeout to 5 seconds but it took 30 seconds until it raised an exception. However, if I used a 60 second timeout then it stopped at 60.02 seconds. I'm going to hazard that there's a setup stage which doesn't have a timeout, and a search stage which does have a timeout?
In the other case I've found, I have a 60 second timeout and it's been working on finding the scaffold for the last few minutes.
The structures come up from random pairings of ChEMBL data. (I do realize this is not the expected use of the scaffold code.) This one which has taken 30+ minutes comes from trying to find the scaffold between:
They are admittedly some rather large structures. My own MCS program times out after 60 seconds with a common structure containing 61 atoms and 60 bonds.
Still, would it be possible to have the timeout checking work for this case?