Fwd: Grammars for retrosynthetic analysis (organic chemistry)

Bryan Bishop

unread,

Jul 9, 2009, 11:44:26 AM7/9/09

to diytrans...@googlegroups.com, kan...@gmail.com, opensourc...@googlegroups.com

(this was originally sent to an advisor of mine at a university)

Route designer - a retrosynthetic analysis tool utilizing automated
retrosynthetic rule generation
http://heybryan.org/books/papers/Route%20designer%20-%20a%20retrosynthetic%20analysis%20tool%20utilizing%20automated%20retrosynthetic%20rule%20generation.pdf

That link may be somewhat slow. Try this one instead:

http://adl.serveftp.org/papers/Route%20designer%20-%20a%20retrosynthetic%20analysis%20tool%20utilizing%20automated%20retrosynthetic%20rule%20generation.pdf

Retrosynthetic analysis is where you start with a target organic
compound and use about ~130 different transformation rules
(transforms) to work back to possible starting compounds. These are
essentially the reaction mechanisms that every student in organic
chemistry ends up hating, or fearing because everyone mentions
something about forced memorization.

So in the following email I proposed a toolchain, where in the
software you could go from "input chemical compound", and the output
would be some microfluidic circuit design for either immediate
printing or implementation with Maragoni flows or EWOD routing
algorithms on some grid. I prefer the print-the-circuit-to-the-task
model at the moment. The microfluidic device in this case should be
thought of as a miniature chemical factory, except on your desktop and
costing significantly less.

"Route Designer" is actually a piece of software that the paper talks
about. But the problem is that it's funded by some company, and it's
proprietary, so it's generally going to be inaccessible. To make our
own, we could construct grammar rules for the reaction mechanisms in
organic chemistry, maybe in the GRXML format for GraphSynth, which
might lead to getting back the synthesis steps, but not the overall
microfluidic circuit (that part comes later).

Maybe I can get someone interested in this enough to want to help me
out. I welcome any collaborators. You can find me on #hplusroadmap on
irc.freenode.net usually.

- Bryan

---------- Forwarded message ----------
From: Bryan Bishop <kan...@gmail.com>
Date: Wed, May 13, 2009 at 10:15 AM
Subject: Grammars for retrosynthetic analysis (organic chemistry)

We were talking about possibly using one of the four undergrads for
retrosynthetic analysis / organic chemistry. In o-chem, there's either
the 130 retrosynthetic analysis operations, or there's also the ~20
"reaction mechanisms" that nearly every undergrad must memorize. Maybe
this would be a suitable project. I've attached the "Route Designer"
2009 paper.

I would find it especially awesome to see "final organic chemical
structure" (in SMILES)- then from a library of components (molecules)
find a path back to basic single or two-atom compounds. Following
this, it should be possible to do the automatic design of a MEMS
factory to put together a factory to manufacture that chemical in
small volumes (or at least qualitatively using reaction chambers and
labware found in a typical chem lab).

- Bryan
http://heybryan.org/
1 512 203 0507

Route designer - a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.pdf

Bryan Bishop

unread,

Jul 9, 2009, 2:00:47 PM7/9/09

to diy...@googlegroups.com, diytrans...@googlegroups.com, kan...@gmail.com, opensourc...@googlegroups.com

On Thu, Jul 9, 2009 at 12:40 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
>> Retrosynthetic analysis is where you start with a target organic
>> compound and use about ~130 different transformation rules
>> (transforms) to work back to possible starting compounds. These are
>> essentially the reaction mechanisms that every student in organic
>> chemistry ends up hating, or fearing because everyone mentions
>> something about forced memorization.
>>
>> So in the following email I proposed a toolchain, where in the
>> software you could go from "input chemical compound", and the output
>> would be some microfluidic circuit design for either immediate
>> printing or implementation with Maragoni flows or EWOD routing
>> algorithms on some grid. I prefer the print-the-circuit-to-the-task
>> model at the moment. The microfluidic device in this case should be
>> thought of as a miniature chemical factory, except on your desktop and
>> costing significantly less.
>

> Something I have thought a lot about, except more in the way of: I have a
> chemical, get me a DNA sequence that will produce the chemical when
> transformed into species X, or another DNA seq for when transforming into
> species Y.

There are certain reaction pathways in the genomes of different
species that lead up to some chemical. But not all chemicals are
synthesized in those pathways. So, the set from which you can select
from is limited in size. I think that the DNA sequence that you are
asking for more depends on which plasmid vectors work, although there
might be other incompatibilities that I am not currently aware of. Any
hints?

>> "Route Designer" is actually a piece of software that the paper talks
>> about. But the problem is that it's funded by some company, and it's
>> proprietary, so it's generally going to be inaccessible. To make our
>> own, we could construct grammar rules for the reaction mechanisms in
>> organic chemistry, maybe in the GRXML format for GraphSynth, which
>> might lead to getting back the synthesis steps, but not the overall
>> microfluidic circuit (that part comes later).
>>
>> Maybe I can get someone interested in this enough to want to help me
>> out. I welcome any collaborators. You can find me on #hplusroadmap on
>> irc.freenode.net usually.
>

> Umm, maybe I can figure out how to get on there...

You can access IRC by using a client like Chatzilla if you use Firefox.

https://addons.mozilla.org/en-US/firefox/addon/16

You can get help for IRC here:

http://irchelp.org/

Once you install an IRC client, connect to irc.freenode.net either
through some selection menu in the GUI or by typing:

/connect irc.freenode.net

Then type:

/nick some_nickname_that_you_want

Then type:

/join #hplusroadmap

>> Does anyone know all their chemical reaction mechanisms by heart? I
>> could go looking through a book, but I'll miss a lot of details that I
>> could otherwise be aware of if only there was someone who could tell
>> me when I'm transcribing the totally wrong idea. And other errors of
>> the typical data entry monkey.
>
> Man I wish, I start Organic chem this fall, but it will be a while until "by
> heart" describes my knowledge of chem.

For some reason I have the most books on organic chemistry than any
other subject on my (physical) bookshelves. One of the books that I
found a few years ago was one that describes organic chemistry
reaction mechanisms entirely in terms of electron flow, which actually
helped my understanding out a lot.

Bryan Bishop

unread,

Jul 9, 2009, 4:29:50 PM7/9/09

to diybio, diytrans...@googlegroups.com, kan...@gmail.com

On Thu, Jul 9, 2009 at 3:19 PM, Eugen Leitl <eu...@leitl.org> wrote:
> Nathan, I know you're having trouble with mobiles, but missing attribution
> does it make really difficult to tell who said what.

Most clients automatically implement it. I would find it peculiar to
see one that doesn't .. that's just bad/unfriendly business practice.

> On Thu, Jul 09, 2009 at 01:40:03PM -0400, Nathan McCorkle wrote:
>>
>> Retrosynthetic analysis is where you start with a target organic
>> compound and use about ~130 different transformation rules
>

> This strikes me as way too low. Name reactions alone
> https://themerckindex.cambridgesoft.com/TheMerckIndex/NameReactions/TOC.asp
> are quite a bit.

You're right, yes, there's way more than 130 there. I see 449 there. I
got the 130 number from the Nobel Lecture on retrosynthetic analysis,
which was perhaps too long ago to count on.

>> (transforms) to work back to possible starting compounds. These are
>> essentially the reaction mechanisms that every student in organic
>> chemistry ends up hating, or fearing because everyone mentions
>> something about forced memorization.
>

> If you have trouble with memorizing things, then you shouldn't study
> chemistry. Or medicine, for that matter.

Right, and you should also understand. Apparently many students are
forced to take organic chemistry classes even though they have no clue
what's going on before and after.

>> So in the following email I proposed a toolchain, where in the
>> software you could go from "input chemical compound", and the
>> output
>> would be some microfluidic circuit design for either immediate
>> printing or implementation with Maragoni flows or EWOD routing
>> algorithms on some grid. I prefer the print-the-circuit-to-the-task
>> model at the moment. The microfluidic device in this case should be
>> thought of as a miniature chemical factory, except on your desktop
>> and
>> costing significantly less.
>

> I think just making an open source retrosyn package which doesn't
> take 8+ cores would be a Very Nice Thing already. I know pharma is
> highly interested in the crappy merchandise we're peddling.

In one of the labs that I work in, we do lots of graph grammars
running for many hours and weeks, thrashing the hard drives like
crazy. This is why I was thinking of grammar rules for a
retrosynthesis system, although there are some ideas for improving it
(like not using .NET (the in-house app uses .NET extensively, even
coupling the generation process to the graphics handler sometimes
(wtf))).

>> Something I have thought a lot about, except more in the way of: I
>> have a chemical, get me a DNA sequence that will produce the chemical
>> when transformed into species X, or another DNA seq for when
>> transforming into species Y.
>

> Real life biochemistry is unfortunately a lot messier than a graph
> of transformations.

Yes, and you shouldn't trust the output of the retrosynthesis package
immediately either. There are always side reactions that have to be
considered.

>> Does anyone know all their chemical reaction mechanisms by heart? I
>

> You're way understimating the magnitude of your task.

Isn't there a distribution curve for the use of different reaction
mechanisms? I mean, the likelihood that anything more complicated than
a basic handful of reaction mechanisms would be needed for the
majority of 80% of the tasks, is low. So the probability distribution
needs to be taken into account, I guess.

>> could go looking through a book, but I'll miss a lot of details
>> that I
>> could otherwise be aware of if only there was someone who could
>> tell
>> me when I'm transcribing the totally wrong idea. And other errors
>> of
>> the typical data entry monkey.
>>
>> Man I wish, I start Organic chem this fall, but it will be a while
>> until "by heart" describes my knowledge of chem.
>

> It's been about a decade since I've last done chemistry professionally.
> There are many open source chemistry efforts, which unfortunately tend
> to stick at early sketches phase. The proprietary stuff is... interesting.
> In the same way a trainwreck is.

Sounds all too familiar.

Bryan Bishop

unread,

Jul 9, 2009, 4:52:07 PM7/9/09

to diy...@googlegroups.com, diytrans...@googlegroups.com, kan...@gmail.com

On Thu, Jul 9, 2009 at 3:47 PM, Eugen Leitl wrote:

> On Thu, Jul 09, 2009 at 03:29:50PM -0500, Bryan Bishop wrote:
>> You're right, yes, there's way more than 130 there. I see 449 there. I
>> got the 130 number from the Nobel Lecture on retrosynthetic analysis,
>> which was perhaps too long ago to count on.
>

> If you want to be exhaustive, the numbers can get quite ridiculous.
> I can ask tomorrow what the total estimate for number of reactions is.

Yes please. Also, in the mean time, here's the rip:

http://adl.serveftp.org/reactions.zip

Too bad all of the diagrams are in some irretrievable format. Why not
just use SMILES plus some sort of standard text-based reaction
representation. Argh.

Bryan Bishop

unread,

Jul 9, 2009, 6:14:45 PM7/9/09

to diy...@googlegroups.com, kan...@gmail.com, diytrans...@googlegroups.com

On Thu, Jul 9, 2009 at 4:05 PM, Cory Tobin wrote:

> On Thu, Jul 9, 2009 at 1:52 PM, Bryan Bishop wrote:
>> Too bad all of the diagrams are in some irretrievable format. Why not
>> just use SMILES plus some sort of standard text-based reaction
>> representation. Argh.
>

> As for the text-based reaction representation, check out SMIRKS
> http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html

Thank you. I like how they have a transform grammar on the page. Does
anyone have a database of SMIRKS transform files laying around
somewhere for the majority of known name reactions?

[*:1][C@:2]([*:3])([*:4])[*:5]>>[*:1][C@@:2]([*:3])([*:4])[*:5]
[*:1][C@:2]([*:3])([*:4])[*:5]>>[*:1][C@:2]([*:4])([*:3])[*:5]

Get it? Ha. Ha.

Reply all

Reply to author

Forward