Fingerprint sub-parts

101 views
Skip to first unread message

Vincent Le Guilloux

unread,
Nov 11, 2011, 3:58:51 AM11/11/11
to indigo-general
Hello there (again),

Regarding fingerprints, what is the meaning of "Ordinary" and "Any"
part of the fingerprints in Indigo ? Searching the documentation and
the discussion group, I couldn't find any answer to that (sorry if I
missed it somehow...).

For example, what part of the fingerprint is the "sub" fingerprint
type using ? And what if I decrease the size of any part of the
fingerprint ? Would this lead to the removal of some bytes, or is
there a hashing technique behind the scene ?

Thanks in advance :)
Vincent.

Mikhail Rybalkin

unread,
Nov 11, 2011, 7:02:43 AM11/11/11
to indigo-general
Hello Vincent,

We definitely need a more detailed description of the fingerprints. I
will try to explain our fingerprints here.

Fingerprint is a binary array. It is built by an exhaustive connected
subgraph enumeration procedure. At the beginning we had only one
homogeneous fingerprint, built by subgraphs with the size up to 7
bonds. To calculate this fingerprint, we calculate a hash of each
subgraph, and set 2 bits based on this has pseduorandomly. Now this
fingerprint forms “substructure” part of our whole fingerprint. Its
size is specified by “fp-ord-qwords” option, and it has the largest
length by default.

After that we decided to use our fingerprints also for similarity
search, but our current algorithm doesn’t work very fast if there are
a lot of bits in the fingerprint. And we decided to add additional
part of a fingerprint specially for similarity search with lower
number of bits. It is built by subgraphs with the size up to 5 bonds.
Size of this part is specified by “fp-sim-qwords” option.

One of the challenging problems was how to provide a good fingerprint
for substructure searching of queries with the different query
features, such as “any bond”, “any atoms”, “atom lists”, or even
SMARTS expressions etc. Let’s support, that we have the following
query: [#7]~1~[#6]CC[#6]~1. If we are using ordinary fingerprints,
then we can only use the part without query features to build the
fingerprint: CCCC. And such fingerprint will be very inaccurate. Our
idea was to build not only ordinary fingerprint, but also additional
part, that could help in such cases. We enumerate all small subgraphs,
and calculate hash of the original subgraph, subgraph where all bonds
replaces with “any bond”, subgraph where all atoms replaces with “any
bond”, and a pure skeleton subgraph with bonds and atoms replaced by
“any bond”, and “any atom”. These subgraphs set bits in a special
fingerprint part, called “any” part. For this example, also the
following subgraph will be processed: [#6]~1~[#6]~[#6]~[#7]~[#6]~1,
[#6]~[#7]~[#6], *~1~*~*~*~*~1. Its size is determined by “fp-any-
qwords” option.

Additionally, we have provided a tautomer substructure search, and
similar techniques are used to build its fingerprint.

And we have “ext” 3-byte part for substructure search, where each bit
corresponds to some property: has changes, has 2 halogens, has
isotopes, etc. This part comes first in our full fingerprint. This
part isn’t built if “fp-ext-enabled” option is false.

So, our fingerprint consists of 5 parts: small ext part, ordinary
part, similarity part, “any” part, and tautomer part.
Now, let’s look at the fingerprint method of an Indigo molecule. It
has a parameter, describing the type.
“sim” – for building similarity part
“full” – for building all parts
“sub” – for building “ext”, “sub”, and ”any” part.
“sub-res” – for building “ext”, and ”any” part.
“sub-tau” – for building “ext”, ”any” , and ”tau” part.

The idea is to use this function, to build fingerprints for
substructure screening for a specific type of search. If you are
building fingerprints for the fast testing before the tautomer
substructure matching test, then you cannot build ordinary fingerprint
because bonds orders can be changed during tautomerization.

All these types of fingerprints are used for screening in Bingo – our
cartridge with the different search methods: similarity, substructure,
tautomer substructure, and others.

PS: Instead of enumerating all small subgraphs, we enumerate small
subtrees and cycles.

PPS: We set not only 2 bits for each fingerprint, but in some cases we
set 3, or even 5 bits. It happens if the small subgraph is very
symmetric.

With best regards,
Mikhail



On Nov 11, 12:58 pm, Vincent Le Guilloux <vince.leguill...@gmail.com>
wrote:

Vincent Le Guilloux

unread,
Nov 11, 2011, 8:09:07 AM11/11/11
to indigo-general
Mikhail,

Thank you very much for this fast and acurate answer! It's perfectly
clear now...

I should definitely take a look at bingo and see if I could eventually
integrate it
in my application (by the way, no MySQL version of it ?).

Thanks again for all :)

Vincent.

Mikhail Rybalkin

unread,
Nov 11, 2011, 8:37:40 AM11/11/11
to indigo-general
Hello Vincent,

We have a conversation with MySQL developer, and found out that is not
possible to add a suitable extension to the database, as we did with
Oracle and MS SQL Server. PostgreSQL provides much more opportunities
to extend its functionality with the external indexing methods, and we
have developed Bingo for PostgreSQL, and publish the first beta
release about a month ago. There are some issues, that have to be
fixed soon. So, the short answer is that Bingo for MySQL is not
planned.

And feel to ask any questions about integration. You can explain what
application do you have, and we can suggest what is the best way (in
terms of performance, or simplicity) to use Indigo and Bingo with it.

With best regards,
Mikhail

On Nov 11, 5:09 pm, Vincent Le Guilloux <vince.leguill...@gmail.com>

Vincent Le Guilloux

unread,
Nov 11, 2011, 8:52:13 AM11/11/11
to indigo-general
For MySQL, Not a real problem in my case, as was actually planning to
integrate PostgreSQL
as an alternative engine.

Not a problem either if bingo is still in beta, I currently don't have
the time
to switch to PostgreSQL , but this is something I'm thinking about
since several
months now, and I think I will get into it in the next year.

But anyways the release of bingo for PostgreSQL is a great news, I'm
looking forward
to test it :)

Vincent.

Ernst-Georg Schmid

unread,
Nov 24, 2011, 9:35:55 AM11/24/11
to indigo-general
Hello,

> So, our fingerprint consists of 5 parts: small ext part, ordinary
> part, similarity part, “any” part, and tautomer part.

so, judging from this explanation and the size descriptions in the
options documentation, if I want to use only the similarity part from
the full fingerprint I'd have to look at the bytes from byte 204 to
byte 268. Is that correct?

best regards,

Ernst-Georg

Mikhail Rybalkin

unread,
Nov 24, 2011, 10:38:57 AM11/24/11
to indigo-general
By default to get similarity part you need to takes bytes from byte
203 to byte 266, including 266. (If numbering starts with 0). 203 = 3
+ 8 * 25: 3 bytes for the “ext” part, and 25 qword for the ordinary
part. Similarity fingerprint size by default takes 8 qword. qword
length is 8 bytes.

On Nov 24, 6:35 pm, Ernst-Georg Schmid

Vincent Le Guilloux

unread,
Nov 28, 2011, 10:16:00 AM11/28/11
to indigo-general
Or instead of computing the whole fingerprint and taking only the part
that one is interested in, why not just setting indigo options
accordingly ?

indigo.setOption("fp-ord-qwords", 0);
indigo.setOption("fp-any-qwords", 0);
indigo.setOption("fp-tau-qwords", 0);

This way in my experience, one can retrieve directly the similarity
part (with the additional ext part), which is 512 bits by default if
I'm correct.

Vincent.

Mikhail Rybalkin

unread,
Nov 29, 2011, 1:28:06 AM11/29/11
to indigo-general
Vincent,

Yes, you are right. But someone might need to compute all parts, and
them take similarity part for the similarity search, and substructure
part for substructure searching.

In addition, for similarity search you can also take substructure
part, and everything will work fine. The only difference, is that
similarity part has lower number of bits.

Best regards,
Mikhail

On Nov 28, 7:16 pm, Vincent Le Guilloux <vince.leguill...@gmail.com>
wrote:

Reply all
Reply to author
Forward
0 new messages