Getting canonical SMILES for large molecule hangs indefinitely

19 views
Skip to first unread message

j.n.a...@gmail.com

unread,
Jul 23, 2020, 9:44:02 AM7/23/20
to indigo-bugs
Hi -

I've found that while Indigo can correctly parse ctfile strings containing many small molecules, and will correctly generate SMILES strings for them, when canonical smiles is retrieved using the molecule.canonicalSmiles() method of the Python API then the library aoppears to hang indefinitely, with 100% of one core CPU usage. It never seems to complete or throw an error.

Unfortunately I cannot share the exact chemical structures, as they're protected by an NDA, but the complete atom count is C835 H946 Cl19 F5 N108 O111 S37 over 30 small molecules. I run canonicalization, which completes fine, before attempting to get the canonical SMILES:

molecule.aromatize()
molecule_smiles = molecule.canonicalSmiles()

I can reproduce this on Indigo 1.2.3 and Indigo 4.0 Beta, on both Windows and Linux. This is using the Python wrappers to call the pre-built Indigo libraries.

All the best,

Jonathan Ayling



indigo-bugs

unread,
Jul 23, 2020, 10:01:40 AM7/23/20
to indigo-bugs
Dear Jonathan Ayling

Canonical SMILES algorithm uses a computing and enumeration of automorphism groups for graphs, and other NP-hard algorithms. It means that the canonicalization will take a huge time for large molecules. Assume your molecules are from peptides, oligonucleotides or similar classes of bio-molecules. If you use canonical SMILES for unique strings the recommendation is to use an appropriate formats with monomers (e.g. HELM)

With best regards
Aleksandr

j.n.a...@gmail.com

unread,
Jul 24, 2020, 5:47:19 AM7/24/20
to indigo-bugs
Thanks Aleksandr, and many thanks for the quick and detailed response.

OK that makes sense, that's great to verify, and splitting into individual monomers is probably the simplest approach from our side as well.

All the best,

Jonathan
Reply all
Reply to author
Forward
0 new messages