comparison of exact vs. approx extractCommonScaffold

20 views
Skip to first unread message

Andrew Dalke

unread,
May 5, 2012, 11:06:36 AM5/5/12
to indigo-...@googlegroups.com
On Apr 17, 2012, at 9:32 AM, Savelyev Alexander wrote:
> Indigo approximate algorithm should be much faster then the exact algorithm. You can regulate iteration limit for the approximate searching. The limit affects the MCS 'completeness'. The default iteration limit is 1000. If you want to increase the iteration limit (e.g. set to 2000) you just need to add a number into the parameters string.

I've started doing some benchmark comparisons. My test case is to pick two structures at random from a desalted subset of ChEMBL and find the MCS between them using extractCommonScaffold. (My next test case will be to pick structures based on similarity, which is more like the expected use case for extractCommonScaffold.)

Using 10,000 such pairwise comparisons, I found:


exact took 161 seconds (62.0 /second) to find the scaffold for 9997 pairs.
That's with a timeout of 15 seconds. Without those three cases, it took
116 seconds (85.9 /second). Overall it matched 82877 atoms and 78936 bonds,
for an average of average 8.3 atoms and 7.9 bonds per largest scaffold.

approx 1000 (the default number of iterations) took 233 seconds (43.0 /second)
and in total found 68455 atoms and 65095 bonds; average: 6.8 atoms 6.5 bonds.
There were no timeouts.

I find this interesting because it means that for this test set it's actually
better - faster and more accurate - to use the exact setting, with a timeout,
and for those rare timeout cases, to use the approx solution.

For comparison:

approx 1 took 19 seconds (527 /second) and matched 42549 atoms and
33730 bonds; average 4.3 atoms 3.4 bonds

approx 500 took 128 seconds (78.4 /second) and matched 66449 atoms and
62882 bonds; average 6.6 atoms 6.3 bonds

approx 1000 took 233 seconds (43.0 /second) and matched 68455 atoms and
65095 bonds; average: 6.8 atoms 6.5 bonds. (Copied from above)

approx 2000 took 445 seconds (22.5 /second) and matched 70019 atoms and
66721 bonds; average of 7.0 atoms and 6.7 bonds

approx 5000 took 1070 seconds (9.3 /second) and matched 71063 atoms and
67801 bonds; average 7.1 atoms 6.8 bonds

approx 10000 took 2099 seconds (4.8 /second) and matched 71758 atoms and
68519 bonds; average of 7.2 atoms and 6.9 bonds

approx 50000 took 10950 seconds (0.9 /second) and matched 72456 atoms and
69251 bonds; average 7.2 atoms 6.9 bonds



Andrew
da...@dalkescientific.com


Savelyev Alexander

unread,
May 6, 2012, 10:10:13 AM5/6/12
to indigo-...@googlegroups.com, indigo-...@googlegroups.com
Hi Andrew,


> I find this interesting because it means that for this test set it's
> actually better - faster and more accurate - to use the exact setting,
> with a timeout, and for those rare timeout cases, to use the approx
> solution.
Yes, you are right. An internal MCS usage (mainly for AAM) has quite
the same strategy: launching the approximate algorithm only after the
exact algorithm exceeds an iteration limit. Moreover, the 'approx' was
basically created for this purpose. The approximate method (in contrast
to the exact method) works polynomial time for all given inputs. It was
tested on molecules contained ~100-300 atoms, which, unfortunately, the
exact algorithm may calculate "forever". Anyway, thank you very much for
the approximate test comparison.

With best regards,
Alexander

Andrew Dalke

unread,
May 8, 2012, 10:47:55 PM5/8/12
to indigo-...@googlegroups.com
I have done a much more extensive set of benchmark comparisons. My conclusion is that for similar structures (0.95 similar or better), the 'approx' method is 50-200 times faster than the exact solution, it's smaller by about 0.5 atoms/0.5 bonds, and it never, ever times out.

For similarities around 0.8 or 100 nearest neighbors, the approx method is only about 10x faster, and is too small by about 1-1.5 atoms and 1-1.5 bonds.

For randomly chosen pairs of structures, the 'approx' method is slower and much less accurate than 'exact', except that it never times out.



To make the comparison, I have developed a new "MCS Benchmark" format. It's a simple line-oriented format which looks like this:

#MCS-Benchmark/1
#File chembl13_knearest_2.smi
# Test cases found using 2-nearest Tanimoto search. Seed=2486356142
# query=CHEMBL526291 num hits=2 minimum score=0.97
1 CHEMBL526291 CHEMBL498211
# query=CHEMBL79165 num hits=2 minimum score=0.87
2 CHEMBL79165 CHEMBL282433
# query=CHEMBL283052 num hits=2 minimum score=0.99
3 CHEMBL283052 CHEMBL26288
# query=CHEMBL357551 num hits=2 minimum score=1.00

In this case it says that test "1" finds the MCS between CHEMBL526291 and CHEMBL498211, test "2" between CHEMBL79165 and CHEMBL282433, and so on. The file "chembl13_knearest_2.smi" contains the structure data.


Format details are in the README in the "benchmark" section of the fmcs source code at
https://bitbucket.org/dalke/fmcs/src
https://bitbucket.org/dalke/fmcs/src/b8b5bcfeae44/benchmark/README


I've written a program to generate benchmark files, in one of two different modes.

There's a "random" option, which selects "k" elements from a structure file (without replacement) and makes a test to find the MCS of those k elements.

There's a "neighbors" option, which selects similar neighbors to a randomly selected fingerprint. For example, the 10 nearest neighbors, or up to 100 neighbors with at least 0.95 similarity.

Details are also in that README.


I then made a test set containing the ChEMBL-13 data set, desalted, unique-ified, and normalized using RDKit. There are about 1,000,000 structures in this set.

From this I generated several different benchmarks, and I ran it against my fmcs algorithm as well as Indigo's "exact" and "approx" methods, using a timeout of 30 seconds.

Detailed results, including the actual structures used (as SMILES), benchmark drivers, etc. is at

https://bitbucket.org/dalke/fmcs/src/b8b5bcfeae44/benchmark

in files ending *.mcs_out .


Here I give the summary for the two Indigo algorithms, which I thought would interest people here.

a) chembl13_pairs: Use 10,000 randomly selected pairs

approx:
# Total 10000 in 235.36 seconds (42.5/second)
# 67850 atoms 64706 bonds; average 6.8 atoms 6.5 bonds
(Note: for the 'approx' method, Complete and Total are identical
because it always finished well within the timeout.)

exact:
# Total 10000 in 168.11 seconds (59.5/second)
# 81984 atoms 78185 bonds; average 8.2 atoms 7.8 bonds
# Complete 9998 in 108.10 seconds (92.5/second)
# 81984 atoms 78185 bonds; average 8.2 atoms 7.8 bonds
# Fail 2 in 60.01 seconds (average 30.0 sec)

This implies that the 'approx' method is not good for randomly
selected pairs. The 'exact' method is faster and more accurate.


b) chembl13_threshold_95: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.95
similarity to the randomly selected fingerprint.

approx:
# Total 1000 in 18.72 seconds (53.4/second)
# 24721 atoms 26721 bonds; average 24.7 atoms 26.7 bonds

exact:
# Total 1000 in 2739.69 seconds (0.4/second)
# 23473 atoms 25414 bonds; average 23.5 atoms 25.4 bonds
# Complete 934 in 759.58 seconds (1.2/second)
# 23473 atoms 25414 bonds; average 25.1 atoms 27.2 bonds
# Fail 66 in 1980.11 seconds (average 30.0 sec)

Here you see the 'approx' method doing a lot better. The
average sizes are only a bit smaller (0.4 atoms and 0.5 bonds)
than the exact search ... and with a lot better performance.

The "failures" are where the exact search times out. These
are likely large structures, which would raise the overall
atom and bond counts, but I don't know by how much.

c) chembl13_threshold_90: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.90
similarity to the randomly selected fingerprint.

approx:
# Total 1000 in 27.03 seconds (37.0/second)
# 21314 atoms 22839 bonds; average 21.3 atoms 22.8 bonds

exact:
# Total 1000 in 1703.93 seconds (0.6/second)
# 21403 atoms 22934 bonds; average 21.4 atoms 22.9 bonds
# Complete 962 in 563.83 seconds (1.7/second)
# 21403 atoms 22934 bonds; average 22.2 atoms 23.8 bonds
# Fail 38 in 1140.10 seconds (average 30.0 sec)

This is a more diverse data set. The approx method takes a
performance hit, and averages about 1 atom and 1 bond smaller
than the exact search, for those cases where an exact search
completed.


d) chembl13_threshold_80: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.80
similarity to the randomly selected fingerprint.

approx:
# Total 1000 in 52.49 seconds (19.1/second)
# 15673 atoms 16437 bonds; average 15.7 atoms 16.4 bonds

exact:
# Total 1000 in 1180.78 seconds (0.8/second)
# 16751 atoms 17594 bonds; average 16.8 atoms 17.6 bonds
# Complete 977 in 490.71 seconds (2.0/second)
# 16751 atoms 17594 bonds; average 17.1 atoms 18.0 bonds
# Fail 23 in 690.07 seconds (average 30.0 sec)

Again, there's increasing diversity here, which makes the approx
search slower and less accurate. Now it's 1.4 atoms smaller
and 1.6 bonds smaller than the 'exact' searches that completed.


e) chembl13_knearest_2: select 1,000 fingerprints at random. For
each, generate a test using the k=2 nearest neighbors.

approx:
# Total 1000 in 8.23 seconds (121.5/second)
# 25137 atoms 27050 bonds; average 25.1 atoms 27.1 bonds

exact:
# Total 1000 in 1543.73 seconds (0.6/second)
# 24573 atoms 26466 bonds; average 24.6 atoms 26.5 bonds
# Complete 961 in 373.65 seconds (2.6/second)
# 24573 atoms 26466 bonds; average 25.6 atoms 27.5 bonds
# Fail 39 in 1170.08 seconds (average 30.0 sec)

High similarity structures is where the approx search shines.
It's *200* times faster than the exact code and it's back to
being off by only 0.5 atoms and 0.5 bonds.


f) chembl13_knearest_10: select 1,000 fingerprints at random. For
each, generate a test using the k=10 nearest neighbors.

approx:
# Total 1000 in 20.85 seconds (48.0/second)
# 18674 atoms 19855 bonds; average 18.7 atoms 19.9 bonds

exact:
# Total 1000 in 1877.00 seconds (0.5/second)
# 18856 atoms 20031 bonds; average 18.9 atoms 20.0 bonds
# Complete 954 in 496.89 seconds (1.9/second)
# 18856 atoms 20031 bonds; average 19.8 atoms 21.0 bonds
# Fail 46 in 1380.11 seconds (average 30.0 sec)

Again the same trend - more diverse structures makes the
approx method slower and less accurate.

e) chembl13_knearest_100: select 1,000 fingerprints at random. For
each, generate a test using the k=100 nearest neighbors.

approx:
# Total 1000 in 79.40 seconds (12.6/second)
# 9127 atoms 9044 bonds; average 9.1 atoms 9.0 bonds

exact:
# Total 1000 in 852.13 seconds (1.2/second)
# 10518 atoms 10464 bonds; average 10.5 atoms 10.5 bonds
# Complete 985 in 402.11 seconds (2.4/second)
# 10518 atoms 10464 bonds; average 10.7 atoms 10.6 bonds
# Fail 15 in 450.02 seconds (average 30.0 sec)

Again the same trend - more diverse structures makes the
approx method slower and less accurate, though it's still 10x
faster than the exact method.


Best regards,

Andrew
da...@dalkescientific.com


Reply all
Reply to author
Forward
0 new messages