Hello Vincent,
We definitely need a more detailed description of the fingerprints. I
will try to explain our fingerprints here.
Fingerprint is a binary array. It is built by an exhaustive connected
subgraph enumeration procedure. At the beginning we had only one
homogeneous fingerprint, built by subgraphs with the size up to 7
bonds. To calculate this fingerprint, we calculate a hash of each
subgraph, and set 2 bits based on this has pseduorandomly. Now this
fingerprint forms “substructure” part of our whole fingerprint. Its
size is specified by “fp-ord-qwords” option, and it has the largest
length by default.
After that we decided to use our fingerprints also for similarity
search, but our current algorithm doesn’t work very fast if there are
a lot of bits in the fingerprint. And we decided to add additional
part of a fingerprint specially for similarity search with lower
number of bits. It is built by subgraphs with the size up to 5 bonds.
Size of this part is specified by “fp-sim-qwords” option.
One of the challenging problems was how to provide a good fingerprint
for substructure searching of queries with the different query
features, such as “any bond”, “any atoms”, “atom lists”, or even
SMARTS expressions etc. Let’s support, that we have the following
query: [#7]~1~[#6]CC[#6]~1. If we are using ordinary fingerprints,
then we can only use the part without query features to build the
fingerprint: CCCC. And such fingerprint will be very inaccurate. Our
idea was to build not only ordinary fingerprint, but also additional
part, that could help in such cases. We enumerate all small subgraphs,
and calculate hash of the original subgraph, subgraph where all bonds
replaces with “any bond”, subgraph where all atoms replaces with “any
bond”, and a pure skeleton subgraph with bonds and atoms replaced by
“any bond”, and “any atom”. These subgraphs set bits in a special
fingerprint part, called “any” part. For this example, also the
following subgraph will be processed: [#6]~1~[#6]~[#6]~[#7]~[#6]~1,
[#6]~[#7]~[#6], *~1~*~*~*~*~1. Its size is determined by “fp-any-
qwords” option.
Additionally, we have provided a tautomer substructure search, and
similar techniques are used to build its fingerprint.
And we have “ext” 3-byte part for substructure search, where each bit
corresponds to some property: has changes, has 2 halogens, has
isotopes, etc. This part comes first in our full fingerprint. This
part isn’t built if “fp-ext-enabled” option is false.
So, our fingerprint consists of 5 parts: small ext part, ordinary
part, similarity part, “any” part, and tautomer part.
Now, let’s look at the fingerprint method of an Indigo molecule. It
has a parameter, describing the type.
“sim” – for building similarity part
“full” – for building all parts
“sub” – for building “ext”, “sub”, and ”any” part.
“sub-res” – for building “ext”, and ”any” part.
“sub-tau” – for building “ext”, ”any” , and ”tau” part.
The idea is to use this function, to build fingerprints for
substructure screening for a specific type of search. If you are
building fingerprints for the fast testing before the tautomer
substructure matching test, then you cannot build ordinary fingerprint
because bonds orders can be changed during tautomerization.
All these types of fingerprints are used for screening in Bingo – our
cartridge with the different search methods: similarity, substructure,
tautomer substructure, and others.
PS: Instead of enumerating all small subgraphs, we enumerate small
subtrees and cycles.
PPS: We set not only 2 bits for each fingerprint, but in some cases we
set 3, or even 5 bits. It happens if the small subgraph is very
symmetric.
With best regards,
Mikhail
On Nov 11, 12:58 pm, Vincent Le Guilloux <
vince.leguill...@gmail.com>
wrote: