I have done a much more extensive set of benchmark comparisons. My conclusion is that for similar structures (0.95 similar or better), the 'approx' method is 50-200 times faster than the exact solution, it's smaller by about 0.5 atoms/0.5 bonds, and it never, ever times out.
For similarities around 0.8 or 100 nearest neighbors, the approx method is only about 10x faster, and is too small by about 1-1.5 atoms and 1-1.5 bonds.
For randomly chosen pairs of structures, the 'approx' method is slower and much less accurate than 'exact', except that it never times out.
To make the comparison, I have developed a new "MCS Benchmark" format. It's a simple line-oriented format which looks like this:
#MCS-Benchmark/1
#File chembl13_knearest_2.smi
# Test cases found using 2-nearest Tanimoto search. Seed=
2486356142
# query=CHEMBL526291 num hits=2 minimum score=0.97
1 CHEMBL526291 CHEMBL498211
# query=CHEMBL79165 num hits=2 minimum score=0.87
2 CHEMBL79165 CHEMBL282433
# query=CHEMBL283052 num hits=2 minimum score=0.99
3 CHEMBL283052 CHEMBL26288
# query=CHEMBL357551 num hits=2 minimum score=1.00
In this case it says that test "1" finds the MCS between CHEMBL526291 and CHEMBL498211, test "2" between CHEMBL79165 and CHEMBL282433, and so on. The file "chembl13_knearest_2.smi" contains the structure data.
Format details are in the README in the "benchmark" section of the fmcs source code at
https://bitbucket.org/dalke/fmcs/src
https://bitbucket.org/dalke/fmcs/src/b8b5bcfeae44/benchmark/README
I've written a program to generate benchmark files, in one of two different modes.
There's a "random" option, which selects "k" elements from a structure file (without replacement) and makes a test to find the MCS of those k elements.
There's a "neighbors" option, which selects similar neighbors to a randomly selected fingerprint. For example, the 10 nearest neighbors, or up to 100 neighbors with at least 0.95 similarity.
Details are also in that README.
I then made a test set containing the ChEMBL-13 data set, desalted, unique-ified, and normalized using RDKit. There are about 1,000,000 structures in this set.
From this I generated several different benchmarks, and I ran it against my fmcs algorithm as well as Indigo's "exact" and "approx" methods, using a timeout of 30 seconds.
Detailed results, including the actual structures used (as SMILES), benchmark drivers, etc. is at
https://bitbucket.org/dalke/fmcs/src/b8b5bcfeae44/benchmark
in files ending *.mcs_out .
Here I give the summary for the two Indigo algorithms, which I thought would interest people here.
a) chembl13_pairs: Use 10,000 randomly selected pairs
approx:
# Total 10000 in 235.36 seconds (42.5/second)
# 67850 atoms 64706 bonds; average 6.8 atoms 6.5 bonds
(Note: for the 'approx' method, Complete and Total are identical
because it always finished well within the timeout.)
exact:
# Total 10000 in 168.11 seconds (59.5/second)
# 81984 atoms 78185 bonds; average 8.2 atoms 7.8 bonds
# Complete 9998 in 108.10 seconds (92.5/second)
# 81984 atoms 78185 bonds; average 8.2 atoms 7.8 bonds
# Fail 2 in 60.01 seconds (average 30.0 sec)
This implies that the 'approx' method is not good for randomly
selected pairs. The 'exact' method is faster and more accurate.
b) chembl13_threshold_95: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.95
similarity to the randomly selected fingerprint.
approx:
# Total 1000 in 18.72 seconds (53.4/second)
# 24721 atoms 26721 bonds; average 24.7 atoms 26.7 bonds
exact:
# Total 1000 in 2739.69 seconds (0.4/second)
# 23473 atoms 25414 bonds; average 23.5 atoms 25.4 bonds
# Complete 934 in 759.58 seconds (1.2/second)
# 23473 atoms 25414 bonds; average 25.1 atoms 27.2 bonds
# Fail 66 in 1980.11 seconds (average 30.0 sec)
Here you see the 'approx' method doing a lot better. The
average sizes are only a bit smaller (0.4 atoms and 0.5 bonds)
than the exact search ... and with a lot better performance.
The "failures" are where the exact search times out. These
are likely large structures, which would raise the overall
atom and bond counts, but I don't know by how much.
c) chembl13_threshold_90: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.90
similarity to the randomly selected fingerprint.
approx:
# Total 1000 in 27.03 seconds (37.0/second)
# 21314 atoms 22839 bonds; average 21.3 atoms 22.8 bonds
exact:
# Total 1000 in 1703.93 seconds (0.6/second)
# 21403 atoms 22934 bonds; average 21.4 atoms 22.9 bonds
# Complete 962 in 563.83 seconds (1.7/second)
# 21403 atoms 22934 bonds; average 22.2 atoms 23.8 bonds
# Fail 38 in 1140.10 seconds (average 30.0 sec)
This is a more diverse data set. The approx method takes a
performance hit, and averages about 1 atom and 1 bond smaller
than the exact search, for those cases where an exact search
completed.
d) chembl13_threshold_80: select 1,000 fingerprints at random. For
each, generate a test using the top 100 structures with at least 0.80
similarity to the randomly selected fingerprint.
approx:
# Total 1000 in 52.49 seconds (19.1/second)
# 15673 atoms 16437 bonds; average 15.7 atoms 16.4 bonds
exact:
# Total 1000 in 1180.78 seconds (0.8/second)
# 16751 atoms 17594 bonds; average 16.8 atoms 17.6 bonds
# Complete 977 in 490.71 seconds (2.0/second)
# 16751 atoms 17594 bonds; average 17.1 atoms 18.0 bonds
# Fail 23 in 690.07 seconds (average 30.0 sec)
Again, there's increasing diversity here, which makes the approx
search slower and less accurate. Now it's 1.4 atoms smaller
and 1.6 bonds smaller than the 'exact' searches that completed.
e) chembl13_knearest_2: select 1,000 fingerprints at random. For
each, generate a test using the k=2 nearest neighbors.
approx:
# Total 1000 in 8.23 seconds (121.5/second)
# 25137 atoms 27050 bonds; average 25.1 atoms 27.1 bonds
exact:
# Total 1000 in 1543.73 seconds (0.6/second)
# 24573 atoms 26466 bonds; average 24.6 atoms 26.5 bonds
# Complete 961 in 373.65 seconds (2.6/second)
# 24573 atoms 26466 bonds; average 25.6 atoms 27.5 bonds
# Fail 39 in 1170.08 seconds (average 30.0 sec)
High similarity structures is where the approx search shines.
It's *200* times faster than the exact code and it's back to
being off by only 0.5 atoms and 0.5 bonds.
f) chembl13_knearest_10: select 1,000 fingerprints at random. For
each, generate a test using the k=10 nearest neighbors.
approx:
# Total 1000 in 20.85 seconds (48.0/second)
# 18674 atoms 19855 bonds; average 18.7 atoms 19.9 bonds
exact:
# Total 1000 in 1877.00 seconds (0.5/second)
# 18856 atoms 20031 bonds; average 18.9 atoms 20.0 bonds
# Complete 954 in 496.89 seconds (1.9/second)
# 18856 atoms 20031 bonds; average 19.8 atoms 21.0 bonds
# Fail 46 in 1380.11 seconds (average 30.0 sec)
Again the same trend - more diverse structures makes the
approx method slower and less accurate.
e) chembl13_knearest_100: select 1,000 fingerprints at random. For
each, generate a test using the k=100 nearest neighbors.
approx:
# Total 1000 in 79.40 seconds (12.6/second)
# 9127 atoms 9044 bonds; average 9.1 atoms 9.0 bonds
exact:
# Total 1000 in 852.13 seconds (1.2/second)
# 10518 atoms 10464 bonds; average 10.5 atoms 10.5 bonds
# Complete 985 in 402.11 seconds (2.4/second)
# 10518 atoms 10464 bonds; average 10.7 atoms 10.6 bonds
# Fail 15 in 450.02 seconds (average 30.0 sec)
Again the same trend - more diverse structures makes the
approx method slower and less accurate, though it's still 10x
faster than the exact method.
Best regards,
Andrew
da...@dalkescientific.com