new API for RGroup Decomposition

96 views
Skip to first unread message

Savelyev Alexander

unread,
Mar 1, 2012, 9:04:19 AM3/1/12
to indigo-general
Dear all,

We have explored that existing Indigo API is not sufficient for all
the given requests and issues for RGroup Decomposition.
I have decided to create a separate topic, therefore, concerned users
could participate in the creating an useful and convenience API.
There are several main requests for the RGroup Decomposition:
- resolve memory usage issue with a big amount molecules input
- add an iterating all matches possibility
- add a possibility to set predefined query scaffolds with Rsites
- add a possibility to save attachment points bond orders
The last issue have been implemented and will be available in the
upcoming release. A boolean option 'deco-save-ap-bond-orders' was
added. Within this flag the indigo saves attachment points as separate
pseudo atoms.

The following python script (the same would be on java and c#) shows
the typical RGroup Decomposition usage (using current API).

----------------------------------------------------------------------------
# load molecules
mols = []
for smiles in structures:
mol = indigo.loadMolecule(smiles)
mols.append(mol)

# prepate query scaffold
scaffold = indigo.loadQueryMolecule("some structure")

# perform decomposition
deco = indigo.decomposeMolecules(scaffold, mols)

# get full scaffold with Rsites
full_scaf = deco.decomposedMoleculeScaffold()

# iterate throug all the structures and get decomposed molecules
for item in deco.iterateDecomposedMolecules():
high_mol = item.decomposedMoleculeHighlighted()
rg_mol = item.decomposedMoleculeWithRGroups()
----------------------------------------------------------------------------

Firstly, we are going to add additional API methods and functions. In
terms of compatibility, current API will be saved (will be only marked
as deprecated).
The memory issue (appears in KNIME nodes) is related to the necessity
of keeping all the molecules in the same array. Therefore, two new
methods appear:
decomposeMoleculesInit() and processMolecule() to handle molecule
structures one by one (function names and parameters may be changed in
a final release). The above example can be replaced by:

----------------------------------------------------------------------------
# prepate query scaffold
scaffold = indigo.loadQueryMolecule("some structure")

# init decomposition
deco = indigo.decomposeMoleculesInit(scaffold)

# iterate through all the structures (an iterator may be non-obvious)
for smiles in structures:
# handle current molecule and add Rsites to full scaffold
item = deco.processMolecule(indigo.loadMolecule(smiles))

# get decomposed molecules
high_mol = item.decomposedMoleculeHighlighted()
rg_mol = item.decomposedMoleculeWithRGroups()

# get full scaffold with Rsites in the end of iteration
full_scaf = deco.decomposedMoleculeScaffold()
----------------------------------------------------------------------------

As you can notice, a principal problem is the problem with generating
so named 'full scaffold' (a scaffold gathers all the Rsites together).
There will be a general question about it below after the describing
the next issue with an iteration matches.
Two items remain. For the iterating purpose we are going to add a new
method, called iterateMatches(), to the deco_item object. Also, user
can set a predefined query scaffold with RSites which may have no
matches with a target molecule. Thus, the processMolecule() function
will throw an exception if there are no embeddings. The above example
inside the iteration loop:

----------------------------------------------------------------------------
...
# preparations

# iterate through all the structures
for smiles in structures:
# handle current molecule and handle exceptions
try:
item = deco.processMolecule(indigo.loadMolecule(smiles))

# get decomposed molecules (first match)
high_mol = item.decomposedMoleculeHighlighted()
rg_mol = item.decomposedMoleculeWithRGroups()

# iterate through all the matches
for q_match in item.iterateMatches()
# get decomposed molecules (current match)
rg_mol = item.decomposedMoleculeWithRGroups()

except Exception,e:
# error handlers
...
----------------------------------------------------------------------------

The last but the most important part. The full scaffold processing. If
user iterates all matches than the scaffold gathers all Rsites for all
the query embeddings. This can generate lots of unnecessary variants.
We can imagine at least two usages. The one is all the possible
RGroups and one scaffold. And the second is the user defined scaffold
e.g. with a minimum Rsites number. So we should give to the user a
possibility to choose matchings for the building a result full
scaffold. We are thinking about two methods for deco_item object:
addAllMatchesToScaffold() and addMatchToScaffold() the functions may
be used separately (if nothing is defined than add only a first
match):

----------------------------------------------------------------------------
...
# preparations

# iterate through all the structures
for smiles in structures:
# handle current molecule and handle exceptions
try:
item = deco.processMolecule(indigo.loadMolecule(smiles))

# get decomposed molecules (first match)
rg_mol = item.decomposedMoleculeWithRGroups()

# add all the matches
item.addAllMatchesToScaffold()

# iterate through all the matches
for q_match in item.iterateMatches()
#add current match
item.addMatchToScaffold()

# get decomposed molecules (current match)
rg_mol = item.decomposedMoleculeWithRGroups()

except Exception,e:
# error handlers
...
----------------------------------------------------------------------------

PS. I hope, my examples are complete enough to understand the problem.
If not, please let me know, I will try to create examples with
pictures.
If you have any questions or comments or additional requirements,
please let us know.


With best regards,
Alexander
GGA Software Services LLC

Savelyev Alexander

unread,
Mar 2, 2012, 8:02:27 AM3/2/12
to indigo-general
Hi everybody,

By the way, the example where the 'possibility to select the matching
for the full scaffold' should be an useful feature can be found in the
neighbour topic:

https://groups.google.com/group/indigo-general/browse_thread/thread/6d77029359364dd8

With best regards,
Alexander

mederich...@gmail.com

unread,
Mar 5, 2012, 11:43:05 AM3/5/12
to indigo-general
Dear Alexander,

I wait your new version to test your new methods/functions.

With best regards.

Médérich

On 2 mar, 14:02, Savelyev Alexander <asavel...@ggasoftware.com> wrote:
> Hi everybody,
>
> By the way, the example where the  'possibility to select the matching
> for the full scaffold' should be an useful feature can be found in the
> neighbour topic:
>
> https://groups.google.com/group/indigo-general/browse_thread/thread/6...
>
> With best regards,
> Alexander

Gerhard en-Naser

unread,
Mar 20, 2012, 5:31:19 AM3/20/12
to indigo-general
Hi @all,
sorry for my late response.

> As you can notice, a principal problem is the problem with generating
> so named 'full scaffold' (a scaffold gathers all the Rsites together).

Sorry I do not understand what 'full scaffold' means.
A Scaffold generated by maximum common substructure search?
If so, should these two tasks separated in different API?
mols = []
for smiles in structures:
mol = indigo.loadMolecule(smiles)
mols.append(mol)
scaffolds = indigo.generateMCSS(mols,other options)
(Perhaps generate MCSS which did not match any Molecule in mols, but a
user defined fraction...)

> ----------------------------------------------------------------------------
> ...
> # preparations
>
> The last but the most important part. The full scaffold processing. If
> user iterates all matches than the scaffold gathers all Rsites for all
> the query embeddings. This can generate lots of unnecessary variants.
I think it would be nice to know if results match different atom
sets.
I user only wants one result per atom set or one result at all, how
about prioritize mappings/atom sets by the canonical smiles of the R-
Groups.
Eg:
Query: c1(R1)cc(R2)ccc1
Molecule: c1(-NH)cc(-OH)ccc1
Results:
Mapping 1 R1-NH R2-OH
Mapping2 R1-OH R2-NH
Generate canonical Smiles from each R-Group. Sort mappings/atom sets
on R-Groups (R1 highest priority).
for smiles in structures:
# handle current molecule and handle exceptions
try:
#sorted list of R-Groups (R1,R2..Rn)
rGroupList=deco.getRgroups()
matches = deco.processMolecule(indigo.loadMolecule(smiles))
#loop over all matches
for atomSet in matches
for match in atomSet
s=’’
for rGroupName in rGroupList
s=s + rGroupName + ‘ :‘ +
match.getCanonicalSmilesFor(rGroupName) + ' '
print s
#loop over first match on all atom sets
for atomSet in matches
s=’’
match=atomSet.getFirstMatch()
for rGroupName in rGroupList
s=s + rGroupName + ‘ :‘ +
match.getCanonicalSmilesFor(rGroupName) + ' '
print s
#get first match on first atomSet
match=matches.getFirstAtomSet().getFirstMatch()
s=’’
for rGroupName in rGroupList
s=s + rGroupName + ‘ :‘ +
match.getCanonicalSmilesFor(rGroupName) + ' '
print s

With best regards.
Gerhard

Savelyev Alexander

unread,
Mar 20, 2012, 9:46:23 AM3/20/12
to indigo-...@googlegroups.com
Dear Gerhard,

Thank you for the response.


> Hi @all,
> sorry for my late response.
>
>> As you can notice, a principal problem is the problem with generating
>> so named 'full scaffold' (a scaffold gathers all the Rsites together).
> Sorry I do not understand what 'full scaffold' means.
> A Scaffold generated by maximum common substructure search?
> If so, should these two tasks separated in different API?
> mols = []
> for smiles in structures:
> mol = indigo.loadMolecule(smiles)
> mols.append(mol)
> scaffolds = indigo.generateMCSS(mols,other options)
> (Perhaps generate MCSS which did not match any Molecule in mols, but a
> user defined fraction...)

No, a full scaffold is not a scaffold generated by MCSS. The scaffold
detection API will NOT be changed. MCSS is not considered in this topic.
The issue applies ONLY to the RGroup Decomposition.

There are two different types of input queries:
- user defined molecule with RGroups (e.g. c1(R1)cc(R2)ccc1 in your
example)
- simple query molecule, which can be passed from the Scaffold
Detection (e.g. c1ccccc1 in your example)

In the first case the full scaffold equals to the user defined molecule
itself and it can not be changed during the RGroup Decomposition.
In the second case the full scaffold should be generated by the library,
and it will be returned by the decomposedMoleculeScaffold() method.

I think the examples below will clarify the logic.
PS. I will use your smiles notation with R1, R2... atoms, which is not
supported by smiles readers, but if someone wants to read it, the
R<Number> string can be replaced by [*:<Number>] with mapping, e.g. R1
>> [*:1], R2 >> [*:2], etc... c1(R1)cc(R2)ccc1 >> c1([*:1])cc([*:2])ccc1

Query: c1(R1)cc(R2)ccc1
Molecule: c1(N)cc(O)ccc1
Full scaffold: c1(R1)cc(R2)ccc1

Query: c1ccccc1
Molecule: c1(N)cc(O)ccc1
Full scaffold: c1(R1)cc(R2)ccc1

As I mentioned before, the full scaffold gathers all the Rsites together
and should match all input molecules.

Query: C1CCNCC1
Molecule: NC1CCNCC1
Full scaffold: (R1)C1CCNCC1

Query: C1CCNCC1
Molecule1: OC1CCNCC1
Molecule2: C1CCNC(N)C1
Full scaffold: (R1)C1CCNC(R2)C1

The next step is the match iterating. I should notice, that for user
defined scaffolds (first case) there are no problems. But in the second
case (and all the API description in the topic above was for this case),
the full scaffold can be various for different matchings. The example
below shows such a possibility.

Query:C1CCCCC1
Molecule:C1CCC(CC1)C1CCC2OC2C1
Full scaffold(possibility 1): ([*:1])C1CCCCC1
Full scaffold(possibility 2): ([*:1])C1CCC2[*:2]C2C1

And we should give an opportunity to user for selecting the match . In
the example below we will select the possibility 2 with max RGroup count.
(PS. I have noticed the typo in my code examples in first letter -
inside the match iterating loop item-->q_match )

# iterate through all the structures
for smiles in structures:

try:
item = deco.processMolecule(indigo.loadMolecule(smiles))
max_r = 0
selected_match=None
# loop over all the matches
for q_match in item.iterateMatches()
# add match with maximum RGroup count
rg_mol = item.decomposedMoleculeWithRGroups()
if rg_mol.countRSites() > max_r:
max_r=rg_mol.countRSites()
selected_match=q_match

#add current match to the full scaffold (possibility2)
current match.addMatchToScaffold()

Suppose we have a second molecule. If we use the code above, the full
scaffold will be:

Query:C1CCCCC1
Molecule1:C1CCC(CC1)C1CCC2OC2C1
Molecule2:OC1CCCCC1O
Full scaffold: ([*:1])C1CC2[*:2]C2CC1([*:3])

But if we do not use the code above:

Query:C1CCCCC1
Molecule1:C1CCC(CC1)C1CCC2OC2C1
Molecule2:OC1CCCCC1O
Full scaffold: ([*:1])C1CCCCC1([*:2])


>
> I think it would be nice to know if results match different atom
> sets.
> I user only wants one result per atom set or one result at all, how
> about prioritize mappings/atom sets by the canonical smiles of the R-
> Groups.
> Eg:
> Query: c1(R1)cc(R2)ccc1
> Molecule: c1(-NH)cc(-OH)ccc1
> Results:
> Mapping 1 R1-NH R2-OH
> Mapping2 R1-OH R2-NH
> Generate canonical Smiles from each R-Group. Sort mappings/atom sets
> on R-Groups (R1 highest priority).
> for smiles in structures:
> # handle current molecule and handle exceptions
> try:
> #sorted list of R-Groups (R1,R2..Rn)
> rGroupList=deco.getRgroups()
> matches = deco.processMolecule(indigo.loadMolecule(smiles))
> #loop over all matches
> for atomSet in matches
> for match in atomSet

> s=��
> for rGroupName in rGroupList
> s=s + rGroupName + � :� +


> match.getCanonicalSmilesFor(rGroupName) + ' '
> print s
> #loop over first match on all atom sets
> for atomSet in matches

> s=��


> match=atomSet.getFirstMatch()
> for rGroupName in rGroupList

> s=s + rGroupName + � :� +


> match.getCanonicalSmilesFor(rGroupName) + ' '
> print s
> #get first match on first atomSet
> match=matches.getFirstAtomSet().getFirstMatch()

> s=��
> for rGroupName in rGroupList
> s=s + rGroupName + � :� +
> match.getCanonicalSmilesFor(rGroupName) + ' '
> print s
>

I think I understand what you meant. Yes, the algorithm will loop only
the matches that generate molecules with different canonical smiles
(considering RGroup number). Suppose we have the following example:

Query:C1CCCCC1
Molecule:NC1CCCCC1

There will be only one match:
(R1)C1CCCCC1, R1=N
And the algorithm will skip the matches C1(R1)CCCCC1, C1C(R1)CCCC1,
C1CC(R1)CCC1, ...etc because it is the same molecule

If we have the example:

Query:C1CCCCC1
Molecule:NC1CCCC(O)C1

There will be two matches:
(R1)C1CCCC(R2)C1, R1=OH, R2=NH2
(R1)C1CCCC(R2)C1, R1=NH2, R2=OH

The last thing, your code contains

#sorted list of R-Groups (R1,R2..Rn)
rGroupList=deco.getRgroups()


If we have simple scaffold query (without RGroups), we can not get the
RGroups, because we do not know them all at the current iteration.
The same example can be implemented using the given API:

Query: c1(R1)cc(R2)ccc1
Molecule: c1(-NH)cc(-OH)ccc1


# loop over all the structures
for smiles in structures:
try:
item = deco.processMolecule(indigo.loadMolecule(smiles))

# get decomposed molecules (first match)
rg_mol = item.decomposedMoleculeWithRGroups()

# add all the matches
item.addAllMatchesToScaffold()

# iterate over all the matches


for q_match in item.iterateMatches()
# get decomposed molecules (current match)

rg_mol = q_match.decomposedMoleculeWithRGroups()
s=''
# iterate over RGroups
for rg in rg_mol.iterateRGroups():
s=s+'R' + str(rg.index()) + ':'
if rg.iterateRGroupFragments().hasNext():
rg_frag = rg.iterateRGroupFragments().next()
#print canonical smiles for a Rgroup fragment
s = s + rg_frag.canonicalSmiles() = '\n'

except Exception,e:
# error handlers

If I have missed something please let me know.

With best regards,
Alexander


mederich...@gmail.com

unread,
Mar 21, 2012, 9:12:30 AM3/21/12
to indigo-general
Hi all,

>I should notice, that for user
>defined scaffolds (first case) there are no problems.

I don't manage to define my own RgroupScaffold for Rgroup
decomposition:
IndigoObject RgScaff = session.loadQueryMolecule("C1CCNCC1[*:
1]C1=CC=CC=C1");
rGroupDecomposition = session.decomposeMolecules(RgScaff ,
indigoMolList);

I Obtain this Exception:
com.ggasoftware.indigo.IndigoException: R-Group deconvolution: no
embeddings obtained
at com.ggasoftware.indigo.Indigo.checkResult(Indigo.java:57)
at com.ggasoftware.indigo.Indigo.decomposeMolecules(Indigo.java:421)

My Scaffold should match on all molecules in indigoMolList.
Maybe my query molecule is not correct.

Someone could help me?

With best regards.

Médérich.

On 20 mar, 14:46, Savelyev Alexander <asavel...@ggasoftware.com>
wrote:
> >                  for rGroupName in rGroupList
> >                       s=s + rGroupName + : +
> > match.getCanonicalSmilesFor(rGroupName) + '   '
> >                  print s
> >         #loop over first match on all atom sets
> >         for atomSet in matches
> >                  s=
> >                  match=atomSet.getFirstMatch()
> >                  for rGroupName in rGroupList
> >                       s=s + rGroupName + : +
> > match.getCanonicalSmilesFor(rGroupName) + '   '
> >                  print s
> >          #get first match on first atomSet
> >          match=matches.getFirstAtomSet().getFirstMatch()
> >          s=
> >          for rGroupName in rGroupList
> >               s=s + rGroupName + : +

Savelyev Alexander

unread,
Mar 21, 2012, 9:34:24 AM3/21/12
to indigo-...@googlegroups.com
Hello,

The new decomposition algorithm and the API is under development at the
moment. Unfortunately, the user defined scaffold is not supported in the
current version. I will inform about the release in this topic.

With best regards,
Alexander

> Hi all,
>
>> I should notice, that for user
>> defined scaffolds (first case) there are no problems.
> I don't manage to define my own RgroupScaffold for Rgroup
> decomposition:
> IndigoObject RgScaff = session.loadQueryMolecule("C1CCNCC1[*:
> 1]C1=CC=CC=C1");
> rGroupDecomposition = session.decomposeMolecules(RgScaff ,
> indigoMolList);
>
> I Obtain this Exception:
> com.ggasoftware.indigo.IndigoException: R-Group deconvolution: no
> embeddings obtained
> at com.ggasoftware.indigo.Indigo.checkResult(Indigo.java:57)
> at com.ggasoftware.indigo.Indigo.decomposeMolecules(Indigo.java:421)
>
> My Scaffold should match on all molecules in indigoMolList.
> Maybe my query molecule is not correct.
>
> Someone could help me?
>
> With best regards.
>

> M�d�rich.

Savelyev Alexander

unread,
Apr 12, 2012, 11:39:25 AM4/12/12
to indigo-...@googlegroups.com, mederich...@gmail.com, gerhard.e...@googlemail.com
Hello all,

We are glad to represent the new RGroup Decomposition API. The Indigo
library version (1.1-beta11) with the new API is already available for
downloading.
http://ggasoftware.com/accept?file=indigo-1.1-beta11%2Findigo-java-1.1-beta11-universal.zip

PS: http://ggasoftware.com/download/indigo_next will be updated soon.

There are several important changes:

- iterate decomposition matches support
- user-defined scaffolds support
- memory usage improvement for the decomposition

All the declared issues were resolved. But the version is in a beta
testing, therefore, we are happy to receive any responses or comments.

Below, I will provide the example scripts on python.

PS. There are changes in function names since the first letter.

simple usage
-----------------------------------------------------------------------------------------------------------


# prepate query scaffold
scaffold = indigo.loadQueryMolecule("some structure")

# init decomposition
deco = indigo.createDecomposer(scaffold)

# iterate over all the structures (an iterator may be non-obvious)


for smiles in structures:
# handle current molecule

item = deco.processMolecule(indigo.loadMolecule(smiles))

# get decomposed molecules and add Rsites to full scaffold


high_mol = item.decomposedMoleculeHighlighted()
rg_mol = item.decomposedMoleculeWithRGroups()

# get full scaffold with Rsites in the end of iteration
full_scaf = deco.decomposedMoleculeScaffold()

-----------------------------------------------------------------------------------------------------------


usage with iterating all matches
-----------------------------------------------------------------------------------------------------------

...
# preparations

# iterate over all the structures
for smiles in structures:


# handle current molecule and handle exceptions
try:

item = deco.processMolecule(indigo.loadMolecule(smiles))

# iterate over all the decompositions
for q_match in item.iterateDecompositions()


# get decomposed molecules (current match)
rg_mol = q_match.decomposedMoleculeWithRGroups()

# add current match
deco.addDecomposition(q_match)

except Exception,e:
# error handlers

...
-----------------------------------------------------------------------------------------------------------

User defined scaffold with predefined RSites is supported by the
algorithm. You can load scaffold from a molfile. The SMILES format is
not supported at the moment. But I think, we will add SMILES support in
the nearest future (e.g. with the ...[*:1]...[*:2]... notation)

example with user-defined scaffold
-----------------------------------------------------------------------------------------------------------
# prepate query scaffold (e.g. '(R1)C1CCCC(R2)C1')
scaffold = indigo.loadQueryMoleculeFromFile("query_mol")

# init decomposition
deco = indigo.createDecomposer(scaffold)

# load molecule
mol = indigo.loadMolecule('NC1CCCC(O)C1')

# create deco item
item = deco.processMolecule(indigo.loadMolecule(smiles))

# iterate over all the decompositions
for q_match in item.iterateDecompositions()
# get decomposed molecule (current match)
rg_mol = q_match.decomposedMoleculeWithRGroups()

# print molfile
print(rg_mol.molfile())

-----------------------------------------------------------------------------------------------------------
In the example above there will be two matches:


(R1)C1CCCC(R2)C1, R1=OH, R2=NH2
(R1)C1CCCC(R2)C1, R1=NH2, R2=OH

If you have any questions, please let us know.

Reply all
Reply to author
Forward
0 new messages