SMILES and SMARTS...

12 views
Skip to first unread message

Egon Willighagen

unread,
Sep 10, 2010, 5:07:21 AM9/10/10
to cheminf-...@googlegroups.com
Hi all,

another question... I noted that SMILES (isomeric, canonical,
CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
latter is that it can be used in mathematical modeling (directly), and
in particular it must have these two criteria:

* fixed length
* each number in the descriptor must have the same meaning

Now, a SMILES does not adhere to these requirements...

A more appropriate place would be under CHEMINF_000035, where it
actually is available as CHEMINF_000020, but excluding the isomeric
and canonical subclasses.

Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
placed as full tree under CHEMINF_000035?

A second thing is that the SMARTS line notation (CHEMINF_000021) is
now enlisted under CHEMINF_000035, which is factually incorrect... can
I add this entity:

"molecular query information format specification" ?

SMARTS would fall under that, as well as MQL.

(And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)

Egon

--
Dr E.L. Willighagen
Post-doc @ Uppsala University (only until 2010-09-30)
Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers

Janna Hastings

unread,
Sep 10, 2010, 5:23:17 AM9/10/10
to cheminf-...@googlegroups.com
Hi Egon,

I think you can go ahead with these changes, and maybe add comments or clarify the definitions of the terms to indicate in the ontology the reasons for the changes?

Cheers, Janna

Leonid Chepelev

unread,
Sep 10, 2010, 1:50:57 PM9/10/10
to cheminf-...@googlegroups.com
Hi, nice to see you looking at it again after the long pause!


> another question... I noted that SMILES (isomeric, canonical,
> CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
> latter is that it can be used in mathematical modeling (directly), and
> in particular it must have these two criteria:

> * fixed length
> * each number in the descriptor must have the same meaning

> Now, a SMILES does not adhere to these requirements...


Just as an aside, could you use the actual labels on these? In Protege, you can actually set the preferences to show labels, so that we don't need to look up what CHEMINF_XXXXXX is.

Now, I do not quite see how a chemical descriptor, CHEMINF_000123 has to have a fixed length. One algorithm may modestly report a descriptor to two decimal places, while another may result in a value with five decimal places, all the while belonging to the same descriptor class.

As for each number in the descriptor having the same meaning, I don't really quite understand what you are trying to say here.

> A more appropriate place would be under CHEMINF_000035, where it
> actually is available as CHEMINF_000020, but excluding the isomeric
> and canonical subclasses.
> Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
> placed as full tree under CHEMINF_000035?

I really don't think so - one is a descriptor, and the other is format specification. Michel, Janna, correct me if I'm wrong, but the two have always been conceived as separate. A descriptor adheres to a specification.

>
> A second thing is that the SMARTS line notation (CHEMINF_000021) is
> now enlisted under CHEMINF_000035, which is factually incorrect... can
> I add this entity:
>
> "molecular query information format specification" ?
>
> SMARTS would fall under that, as well as MQL.

OK, so what would the description of molecular query information format specification would read? Because the current description is "An atomic connectivity molecular structure encoding format specification is a molecular structure encoding format specification which specifies a format in which the connectivity of atoms within a molecule may be specified." Could you pleas qualify your statements instead of asserting them?

> (And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)

Again, NO, as far as I understand, these are entirely different, one is a format specification and the other is an actual file.

Michel Dumontier

unread,
Sep 10, 2010, 2:55:04 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 1:50 PM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
> Hi, nice to see you looking at it again after the long pause!
>
>> another question... I noted that SMILES (isomeric, canonical,
>> CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
>> latter is that it can be used in mathematical modeling (directly), and
>> in particular it must have these two criteria:
>>
>> * fixed length
>> * each number in the descriptor must have the same meaning
>>
>> Now, a SMILES does not adhere to these requirements...
>>
>
> Just as an aside, could you use the actual labels on these?
+1

> In Protege, you
> can actually set the preferences to show labels, so that we don't need to
> look up what CHEMINF_XXXXXX is.
> Now, I do not quite see how a chemical descriptor, CHEMINF_000123 has to
> have a fixed length. One algorithm may modestly report a descriptor to two
> decimal places, while another may result in a value with five decimal
> places, all the while belonging to the same descriptor class.
> As for each number in the descriptor having the same meaning, I don't really
> quite understand what you are trying to say here.
>
>> A more appropriate place would be under CHEMINF_000035, where it
>> actually is available as CHEMINF_000020, but excluding the isomeric
>> and canonical subclasses.
>> Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
>> placed as full tree under CHEMINF_000035?
>
> I really don't think so - one is a descriptor, and the other is format
> specification. Michel, Janna, correct me if I'm wrong, but the two have
> always been conceived as separate. A descriptor adheres to a specification.

exactly as Leo states it.

>>
>> A second thing is that the SMARTS line notation (CHEMINF_000021) is
>> now enlisted under CHEMINF_000035, which is factually incorrect... can
>> I add this entity:
>>
>> "molecular query information format specification" ?
>>
>> SMARTS would fall under that, as well as MQL.

SMARTS strings represent substructure patterns - they can be used as
queries using appropriate software.

>>
> OK, so what would the description of molecular query information format
> specification would read? Because the current description is "An atomic
> connectivity molecular structure encoding format specification is a
> molecular structure encoding format specification which specifies a format
> in which the connectivity of atoms within a molecule may be specified."
> Could you pleas qualify your statements instead of asserting them?
>> (And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)
>
> Again, NO, as far as I understand, these are entirely different, one is a
> format specification and the other is an actual file.
>>
>> Egon
>>
>> --
>> Dr E.L. Willighagen
>> Post-doc @ Uppsala University (only until 2010-09-30)
>> Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg
>> Homepage: http://egonw.github.com/
>> LinkedIn: http://se.linkedin.com/in/egonw
>> Blog: http://chem-bla-ics.blogspot.com/
>> PubList: http://www.citeulike.org/user/egonw/tag/papers
>

--
Michel Dumontier
Associate Professor of Bioinformatics
Carleton University
http://dumontierlab.com

Egon Willighagen

unread,
Sep 10, 2010, 2:58:52 PM9/10/10
to cheminf-...@googlegroups.com
Hi Leonid,

On Fri, Sep 10, 2010 at 7:50 PM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
> Hi, nice to see you looking at it again after the long pause!

Yeah, with the ACS now behind me, and a few book chapters almost done, ....

>> Now, a SMILES does not adhere to these requirements...
>
> Just as an aside, could you use the actual labels on these? In Protege, you
> can actually set the preferences to show labels, so that we don't need to
> look up what CHEMINF_XXXXXX is.

I still need to look up where exactly I can find this setting... I
feel to Protege what others feel to Git, I think...: it has somewhat
of a learning curve...

> Now, I do not quite see how a chemical descriptor, CHEMINF_000123 has to
> have a fixed length. One algorithm may modestly report a descriptor to two
> decimal places, while another may result in a value with five decimal
> places, all the while belonging to the same descriptor class.

The 'fixed length' refers to the number of numbers, not the precision...

mol1: 1.5678, 2.3, 4, 6, 8
mol2: 1.4, 2.765774, 4, 7, 8

These are both descriptors with the same, fixed length, even though
one has 6 atoms, the other has 7 atoms (4th column)... the number of
digits is indeed not important.

> As for each number in the descriptor having the same meaning, I don't really
> quite understand what you are trying to say here.

Each column must mean the same thing. So, a descriptor must not return
two numbers for one molecule, and four numbers for another. For each
'column' you must be able to calculate a meaningful distance...

>> A more appropriate place would be under CHEMINF_000035, where it
>> actually is available as CHEMINF_000020, but excluding the isomeric
>> and canonical subclasses.
>> Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
>> placed as full tree under CHEMINF_000035?
>
> I really don't think so - one is a descriptor, and the other is format
> specification. Michel, Janna, correct me if I'm wrong, but the two have
> always been conceived as separate.

Ah, OK. Sorry about that. Yes, 00035 is the specification...

OK, I will look at it again... I thought it was falling under
'molecular descriptor', or something equivalent to that, which is not
correct...

Let me reboot back into a working machine, with Protege, and I will
explain myself more clearly... apologies for the lack of enough
detail.

> A descriptor adheres to a specification.

A SMILES is not a molecular descriptor: it does not have a fixed
length, and position 7 in the string does not have a well-defined
meaning; a SMILES cannot be used in computation.

SMILES is a line notation, and as such is a format specification.

>> A second thing is that the SMARTS line notation (CHEMINF_000021) is
>> now enlisted under CHEMINF_000035, which is factually incorrect... can
>> I add this entity:
>>
>> "molecular query information format specification" ?
>>
>> SMARTS would fall under that, as well as MQL.
>>
> OK, so what would the description of molecular query information format
> specification would read?

Something like:

"An encoding specifying a molecular search query."

> Because the current description is "An atomic
> connectivity molecular structure encoding format specification is a
> molecular structure encoding format specification which specifies a format
> in which the connectivity of atoms within a molecule may be specified."

Indeed. And a query does not one molecule or one molecular
substructure; a SMARTS is a fuzzy description of what a substructure
may look like, but it does not encode a molecular structure, it
encodes a query.

> Could you pleas qualify your statements instead of asserting them?

Please explain.

>> (And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)
>
> Again, NO, as far as I understand, these are entirely different, one is a
> format specification and the other is an actual file.

OK. I has considered that, but that was not clear to me.

I'll get back on this asap...

Michel Dumontier

unread,
Sep 10, 2010, 3:08:18 PM9/10/10
to cheminf-...@googlegroups.com

SMILES is a specification, but the SMILES descriptor is a value that
adheres to SMILES, and provides information about molecular structure.

m.

--

Leonid Chepelev

unread,
Sep 10, 2010, 4:15:19 PM9/10/10
to cheminf-...@googlegroups.com
Hi,

> I still need to look up where exactly I can find this setting... I
> feel to Protege what others feel to Git, I think...: it has somewhat
> of a learning curve...

Really? For a guy like you, should be no problem! :P

File>Preferences>Renderer(Tab)>Render entities using annotation values

> The 'fixed length' refers to the number of numbers, not the precision...
>
> mol1: 1.5678, 2.3, 4, 6, 8
> mol2: 1.4, 2.765774, 4, 7, 8
>
> These are both descriptors with the same, fixed length, even though
> one has 6 atoms, the other has 7 atoms (4th column)... the number of
> digits is indeed not important.

You are still not explaining how SMILES is not adhering to this. If you mean we have SMILES that are fragmented and fragment order can change, to me it's a complex that the SMILES string refers to, and as such does not constitute an array of values. Other than that, I'm just not seeing how this is contrary to the definition of a descriptor. Besides, what you are citing is a complex descriptor, which we have been modeling as a collection of simple descriptors, each of which would have a value and a MEANING.

> Each column must mean the same thing. So, a descriptor must not return
> two numbers for one molecule, and four numbers for another. For each
> 'column' you must be able to calculate a meaningful distance...

Again, I'm not seeing how that makes the existing classification invalid.

> Ah, OK. Sorry about that. Yes, 00035 is the specification...
>
> OK, I will look at it again... I thought it was falling under
> 'molecular descriptor', or something equivalent to that, which is not
> correct...
>
> Let me reboot back into a working machine, with Protege, and I will
> explain myself more clearly... apologies for the lack of enough
> detail.

Thanks ;D



> A SMILES is not a molecular descriptor: it does not have a fixed
> length, and position 7 in the string does not have a well-defined
> meaning; a SMILES cannot be used in computation.
>
> SMILES is a line notation, and as such is a format specification.

I would second what Michel said regarding the relationship of a descriptor and the specification. Plus, I would second what I said just above, if we are referring to the SMILES descriptor with its xsd:string value.

> Something like:
>
> "An encoding specifying a molecular search query."
>
> Indeed. And a query does not one molecule or one molecular
> substructure; a SMARTS is a fuzzy description of what a substructure
> may look like, but it does not encode a molecular structure, it
> encodes a query.

Hmm, that's an interesting way to look at it, and is probably a worthwhile addition. Can we reach a consensus, Michel, Janna? You could also look at SMARTS describing the structure of one generalized class of compounds.


> > Could you pleas qualify your statements instead of asserting them?
>
> Please explain.

I was referring to your statement "A second thing is that the SMARTS line notation (CHEMINF_000021) is now enlisted under CHEMINF_000035, which is factually incorrect..."

Asserting that something is "factually incorrect" is quite different and much less informative that qualifying your statement by explicitly specifying your arguments for or against a case. In other words, it's what makes the difference between absolutism and democracy.


> OK. I has considered that, but that was not clear to me.

It could have been confusing because we only have one representative in that class. The idea was to grow it to all the file formats, say, that OpenBabel reads. Actually, come to think of it, this is extremely easy, and I could get to it soon (right after I finish this paper that I've said I would have by the end of this week).
 
> I'll get back on this asap...

May the force be with you. :D


Janna Hastings

unread,
Sep 10, 2010, 4:23:13 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 9:15 PM, Leonid Chepelev <leonid_...@hotmail.com> wrote:

> Something like:
>
> "An encoding specifying a molecular search query."
>
> Indeed. And a query does not one molecule or one molecular
> substructure; a SMARTS is a fuzzy description of what a substructure
> may look like, but it does not encode a molecular structure, it
> encodes a query.

Hmm, that's an interesting way to look at it, and is probably a worthwhile addition. Can we reach a consensus, Michel, Janna? You could also look at SMARTS describing the structure of one generalized class of compounds.

InChI and SMILES can be used as queries too -- consider exact matches and similarity matches.

Something like MQL is definitely falling on the side of querying, but I suppose could be used to describe classes as well.

But despite this ambiguity I am not against including a term for query encodings and putting SMARTS and MQL beneath, on the grounds that these formats are *mostly* used in that way. If that is true, then I think it warrants the distinction.

Michel?

Cheers, Janna

Egon Willighagen

unread,
Sep 10, 2010, 4:45:25 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 10:15 PM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
> Hi,
>> I still need to look up where exactly I can find this setting... I
>> feel to Protege what others feel to Git, I think...: it has somewhat
>> of a learning curve...
>
> Really? For a guy like you, should be no problem! :P

Well... I'm sort of used to source code... not GUI clicking...
different worlds... really, I prefer looking at the OWL XML much more
than Protege...

> File>Preferences>Renderer(Tab)>Render entities using annotation values

Yeah, tried that again and again... but that labeling is not working
for me... it shows up for many items (rdfs:label), but fails for all
in CHEMINF... neither in 4.0 nor in 4.1 beta... no clue about that...

>> The 'fixed length' refers to the number of numbers, not the precision...
>>
>> mol1: 1.5678, 2.3, 4, 6, 8
>> mol2: 1.4, 2.765774, 4, 7, 8
>>
>> These are both descriptors with the same, fixed length, even though
>> one has 6 atoms, the other has 7 atoms (4th column)... the number of
>> digits is indeed not important.
>
> You are still not explaining how SMILES is not adhering to this.

c1ccccc1 (8 length descriptor)
CCO (3 length descriptor)

> If you mean
> we have SMILES that are fragmented and fragment order can change, to me it's
> a complex that the SMILES string refers to, and as such does not constitute
> an array of values.

Indeed... OK, I tried to make sense of my earlier observations...

here's the deal... 000123 is a 'chemical descriptor' and has
alternativeID BODO:Descriptor... now the latter is fixed length...
Now, there is also the equivalentClass 000065 ('molecular entity
descriptor) which is 0000123 about a 'molecular entity', which also
has this fixed-length hard-coded...

And as the 'molecular entity' does not imply this, so my assumption
was that this was inherited from 000123...

> Other than that, I'm just not seeing how this is
> contrary to the definition of a descriptor. Besides, what you are citing is
> a complex descriptor, which we have been modeling as a collection of simple
> descriptors, each of which would have a value and a MEANING.

How/Where can I see this in the OWL?

>> Each column must mean the same thing. So, a descriptor must not return
>> two numbers for one molecule, and four numbers for another. For each
>> 'column' you must be able to calculate a meaningful distance...
>
> Again, I'm not seeing how that makes the existing classification invalid.

Why not? Each first character of two or more SMILES does not refer to
the same thing and cannot be compared, and therefore unsuitable for
mathematical treatment... like 000065 and BODO:Descriptor...

So, what I am struggling with is the lack of clear distinction
(hierarchy) between mathematically/statistically usable descriptors
(BODO:Descriptor, CHEMINF:000065) and those that are not
(CHEMINF:000123)...

>> A SMILES is not a molecular descriptor: it does not have a fixed
>> length, and position 7 in the string does not have a well-defined
>> meaning; a SMILES cannot be used in computation.
>>
>> SMILES is a line notation, and as such is a format specification.
>
> I would second what Michel said regarding the relationship of a descriptor
> and the specification. Plus, I would second what I said just above, if we
> are referring to the SMILES descriptor with its xsd:string value.

Yes, got that and fully agree.

>> Something like:
>>
>> "An encoding specifying a molecular search query."
>>
>> Indeed. And a query does not one molecule or one molecular
>> substructure; a SMARTS is a fuzzy description of what a substructure
>> may look like, but it does not encode a molecular structure, it
>> encodes a query.
>
> Hmm, that's an interesting way to look at it, and is probably a worthwhile
> addition. Can we reach a consensus, Michel, Janna? You could also look at
> SMARTS describing the structure of one generalized class of compounds.

see below.

>> > Could you pleas qualify your statements instead of asserting them?
>>
>> Please explain.
>
> I was referring to your statement "A second thing is that the SMARTS line
> notation (CHEMINF_000021) is now enlisted under CHEMINF_000035, which is
> factually incorrect..."

The thing here is that they are supposed to apply to a 'chemical
entity' while SMARTS refers to a set of chemical entities.

> Asserting that something is "factually incorrect" is quite different and
> much less informative that qualifying your statement by explicitly
> specifying your arguments for or against a case. In other words, it's what
> makes the difference between absolutism and democracy.

OK :)

>> OK. I has considered that, but that was not clear to me.
>
> It could have been confusing because we only have one representative in that
> class. The idea was to grow it to all the file formats, say, that OpenBabel
> reads. Actually, come to think of it, this is extremely easy, and I could
> get to it soon (right after I finish this paper that I've said I would have
> by the end of this week).

The CDK defines quite a lot of formats too, to which you have
programmatic access...

>> I'll get back on this asap...
>
> May the force be with you. :D

Well, I rather have Protege cooperate a bit more :)

Egon Willighagen

unread,
Sep 10, 2010, 4:55:05 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 10:23 PM, Janna Hastings
<janna.h...@gmail.com> wrote:
> On Fri, Sep 10, 2010 at 9:15 PM, Leonid Chepelev <leonid_...@hotmail.com> wrote:
>> Hmm, that's an interesting way to look at it, and is probably a worthwhile
>> addition. Can we reach a consensus, Michel, Janna? You could also look at
>> SMARTS describing the structure of one generalized class of compounds.
>
> InChI and SMILES can be used as queries too -- consider exact matches and
> similarity matches.

Any SMILES is actually a valid SMARTS, but... any search with InChI
would be a similarity match by default....

But that was not even what I was referring too, though an interesting
point... an InChI and a SMILES refer to one chemical entity
(polyatomic entity, to be precise), but MQL and SMARTS to do not refer
to single polyatomic entity, but a setOf...

> Something like MQL is definitely falling on the side of querying, but I
> suppose could be used to describe classes as well.
>
> But despite this ambiguity I am not against including a term for query
> encodings and putting SMARTS and MQL beneath, on the grounds that these
> formats are *mostly* used in that way.

In what way?

Janna Hastings

unread,
Sep 10, 2010, 4:58:02 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 9:55 PM, Egon Willighagen <egon.wil...@gmail.com> wrote:
> InChI and SMILES can be used as queries too -- consider exact matches and
> similarity matches.

Any SMILES is actually a valid SMARTS, but... any search with InChI
would be a similarity match by default....

But that was not even what I was referring too, though an interesting
point... an InChI and a SMILES refer to one chemical entity
(polyatomic entity, to be precise), but MQL and SMARTS to do not refer
to single polyatomic entity, but a setOf...

Agreed, this is a valid distinction.
 
> Something like MQL is definitely falling on the side of querying, but I
> suppose could be used to describe classes as well.
>
> But despite this ambiguity I am not against including a term for query
> encodings and putting SMARTS and MQL beneath, on the grounds that these
> formats are *mostly* used in that way.

In what way?


I meant, used as queries.

Janna

Egon Willighagen

unread,
Sep 10, 2010, 5:00:56 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 10:45 PM, Egon Willighagen
<egon.wil...@gmail.com> wrote:
> On Fri, Sep 10, 2010 at 10:15 PM, Leonid Chepelev
>> File>Preferences>Renderer(Tab)>Render entities using annotation values
>
> Yeah, tried that again and again... but that labeling is not working
> for me... it shows up for many items (rdfs:label), but fails for all
> in CHEMINF... neither in 4.0 nor in 4.1 beta... no clue about that...

Can't get it more informative that what's on the screenshot...

E.

protegeFail.png

Michel Dumontier

unread,
Sep 10, 2010, 5:22:45 PM9/10/10
to cheminf-...@googlegroups.com
try the following (see image)

m.

On Fri, Sep 10, 2010 at 5:00 PM, Egon Willighagen

--

annotation_renderer.png

Leonid Chepelev

unread,
Sep 10, 2010, 5:29:16 PM9/10/10
to cheminf-...@googlegroups.com
Hi,

 
> Yeah, tried that again and again... but that labeling is not working
> for me... it shows up for many items (rdfs:label), but fails for all
> in CHEMINF... neither in 4.0 nor in 4.1 beta... no clue about that...

Interesting. In my case it's working perfectly and shows all annotations. Try the different annotations, see if that works.

 
> c1ccccc1 (8 length descriptor)
> CCO (3 length descriptor)

What does a SMILES descriptor refer to? A molecular entity. Because this is a string value, this has nothing to do with the arrays of descriptors you've been showing earlier. The 8 length descriptor is no different from the 3 length descriptor in principle - it is ONE string, and although the individual SMILES tokens refer to the different atoms, we are operating on molecules, as such I am arguing, assuming that we are on the same page here and that when you say 'values' you REALLY mean array entries, there is only ONE array entry, and therefore the descriptor is of length ONE.
 
I'll repeat: because we are referring to a molecular entity (that is, an actual molecule/ion), there is only a descriptor of length one - that one SMILES string.

 
> > If you mean
> > we have SMILES that are fragmented and fragment order can change, to me it's
> > a complex that the SMILES string refers to, and as such does not constitute
> > an array of values.
>
> Indeed... OK, I tried to make sense of my earlier observations...
>
> here's the deal... 000123 is a 'chemical descriptor' and has
> alternativeID BODO:Descriptor... now the latter is fixed length...
> Now, there is also the equivalentClass 000065 ('molecular entity
> descriptor) which is 0000123 about a 'molecular entity', which also
> has this fixed-length hard-coded...
>
> And as the 'molecular entity' does not imply this, so my assumption
> was that this was inherited from 000123...
 
Let's see, taking into consideration what I said above, I don't see any problem here.


> > Other than that, I'm just not seeing how this is
> > contrary to the definition of a descriptor. Besides, what you are citing is
> > a complex descriptor, which we have been modeling as a collection of simple
> > descriptors, each of which would have a value and a MEANING.
>
> How/Where can I see this in the OWL?

That's what I recall from the discussions regarding CHEMINF design principles. If I'm wrong, please correct me Janna and Michel.
We personally have been using a functional hasValue attribute.


> >> Each column must mean the same thing. So, a descriptor must not return
> >> two numbers for one molecule, and four numbers for another. For each
> >> 'column' you must be able to calculate a meaningful distance...
> >
> > Again, I'm not seeing how that makes the existing classification invalid.
>
> Why not? Each first character of two or more SMILES does not refer to
> the same thing and cannot be compared, and therefore unsuitable for
> mathematical treatment... like 000065 and BODO:Descriptor...
> So, what I am struggling with is the lack of clear distinction
> (hierarchy) between mathematically/statistically usable descriptors
> (BODO:Descriptor, CHEMINF:000065) and those that are not
> (CHEMINF:000123)...
 
At the same time, each SMILES string can be compared to another SMILES string, as I said above, making both useable, and both fixed-length. However, I may be missing something: are you planning to refer to a character in a particular position in the SMILES string without tokenizing it first and constructing a chemical graph? Where would you directly, and without further processing of the molecular graph, use the exact characters of the SMILES string in a mathematically meaningful context other than simple text analysis? I just don't see how it's possible to perceive the one SMILES string as an array without invoking the concept of a molecular graph.

 

> The thing here is that they are supposed to apply to a 'chemical
> entity' while SMARTS refers to a set of chemical entities.

I  think we've been over this now and we reached an agreement.


> The CDK defines quite a lot of formats too, to which you have
> programmatic access...

My choice of OpenBabel is dictated by the ease of programmatic access of the formats with OpenBabel, and possibly the biggest collection of interconvertible format types, while CDK has only a handful of readers with, at times, clunky/fussy access.

 
> Well, I rather have Protege cooperate a bit more :)

That's what I meant, may the force be with you! :D

Egon Willighagen

unread,
Sep 10, 2010, 5:54:53 PM9/10/10
to cheminf-...@googlegroups.com
On Fri, Sep 10, 2010 at 11:29 PM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
> What does a SMILES descriptor refer to? A molecular entity. Because this is
> a string value, this has nothing to do with the arrays of descriptors you've
> been showing earlier. The 8 length descriptor is no different from the 3
> length descriptor in principle - it is ONE string, and although the
> individual SMILES tokens refer to the different atoms, we are operating on
> molecules, as such I am arguing, assuming that we are on the same page here
> and that when you say 'values' you REALLY mean array entries, there is only
> ONE array entry, and therefore the descriptor is of length ONE.

Fair, but there is still no mathematically means of comparing two
values... which is required for mathematical treatment, as required by
CHEMINF:000065...

>> here's the deal... 000123 is a 'chemical descriptor' and has
>> alternativeID BODO:Descriptor... now the latter is fixed length...
>> Now, there is also the equivalentClass 000065 ('molecular entity
>> descriptor) which is 0000123 about a 'molecular entity', which also
>> has this fixed-length hard-coded...
>>
>> And as the 'molecular entity' does not imply this, so my assumption
>> was that this was inherited from 000123...
>
> Let's see, taking into consideration what I said above, I don't see any
> problem here.

And I do see a problem here.

>> > Other than that, I'm just not seeing how this is
>> > contrary to the definition of a descriptor. Besides, what you are citing
>> > is
>> > a complex descriptor, which we have been modeling as a collection of
>> > simple
>> > descriptors, each of which would have a value and a MEANING.
>>
>> How/Where can I see this in the OWL?
>
> That's what I recall from the discussions regarding CHEMINF design
> principles. If I'm wrong, please correct me Janna and Michel.
> We personally have been using a functional hasValue attribute.

Sorry, you lost me...

>> Why not? Each first character of two or more SMILES does not refer to
>> the same thing and cannot be compared, and therefore unsuitable for
>> mathematical treatment... like 000065 and BODO:Descriptor...
>> So, what I am struggling with is the lack of clear distinction
>> (hierarchy) between mathematically/statistically usable descriptors
>> (BODO:Descriptor, CHEMINF:000065) and those that are not
>> (CHEMINF:000123)...
>
> At the same time, each SMILES string can be compared to another SMILES
> string, as I said above, making both useable, and both fixed-length.
> However, I may be missing something: are you planning to refer to a
> character in a particular position in the SMILES string without tokenizing
> it first and constructing a chemical graph?

What other mathematical means did you have in mind of comparing two
character strings?

But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
about mathematical treatment... for a SMILES string this is undefined.

> My choice of OpenBabel is dictated by the ease of programmatic access of the
> formats with OpenBabel, and possibly the biggest collection of
> interconvertible format types, while CDK has only a handful of readers with,
> at times, clunky/fussy access.

The CDK defines 99 file formats... how many formats does OB define?

Leonid Chepelev

unread,
Sep 10, 2010, 6:08:48 PM9/10/10
to cheminf-...@googlegroups.com
> Fair, but there is still no mathematically means of comparing two
> values... which is required for mathematical treatment, as required by
> CHEMINF:000065...

Graph theory is now excluded from mathematics?

 
>
> >> here's the deal... 000123 is a 'chemical descriptor' and has
> >> alternativeID BODO:Descriptor... now the latter is fixed length...
> >> Now, there is also the equivalentClass 000065 ('molecular entity
> >> descriptor) which is 0000123 about a 'molecular entity', which also
> >> has this fixed-length hard-coded...
> >>
> >> And as the 'molecular entity' does not imply this, so my assumption
> >> was that this was inherited from 000123...
> >
> > Let's see, taking into consideration what I said above, I don't see any
> > problem here.
>
> And I do see a problem here.

One string, one value, an evenly sized descriptor. I don't know what I can add to make it more clear. Maybe Michel and Janna can help me here.
That was my question. And you haven't answered it. Unless you can answer it, my point stands: a SMILES string is one solid value instead of an array of values, and there is no contradiction to BODO:Descriptor.
 
Is there something else that you are not telling us? Like some parser somewhere that's hard-coded to not accept strings or something that you don't want to change over?

>
> But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
> about mathematical treatment... for a SMILES string this is undefined.
 
Graph theory.

 
>
> > My choice of OpenBabel is dictated by the ease of programmatic access of the
> > formats with OpenBabel, and possibly the biggest collection of
> > interconvertible format types, while CDK has only a handful of readers with,
> > at times, clunky/fussy access.
>
> The CDK defines 99 file formats... how many formats does OB define?

OpenBabel READS AND WRITES over 90 file formats with plugins available. CDK has only about 30 non-overlapping readers.
 
In any case, that's not important. What is important is that wherever the formats are taken from, it's easy to add them.

Michel_Dumontier

unread,
Sep 10, 2010, 6:18:35 PM9/10/10
to cheminf-...@googlegroups.com

> Fair, but there is still no mathematically means of comparing two
> values... which is required for mathematical treatment, as required by
> CHEMINF:000065...

I've modified the definition of 'chemical descriptor' and 'molecular entity descriptor'

'A chemical descriptor is a data item (quantity or value) about a chemical entity that conforms to a specification for how it is calculated, measured or recorded.'

'a molecular entity descriptor is a chemical descriptor that provides information about a molecular entity.'

> >> here's the deal... 000123 is a 'chemical descriptor' and has
> >> alternativeID BODO:Descriptor... now the latter is fixed length...
> >> Now, there is also the equivalentClass 000065 ('molecular entity
> >> descriptor) which is 0000123 about a 'molecular entity', which also
> >> has this fixed-length hard-coded...
> >>
> >> And as the 'molecular entity' does not imply this, so my assumption
> >> was that this was inherited from 000123...
> >
> > Let's see, taking into consideration what I said above, I don't see
> any
> > problem here.
>
> And I do see a problem here.

There is no 'fixed-length' requirement for CHEMINF descriptors. If this is the case for BODO:Descriptor, then they are not the same. I can add a "fixed-length" descriptor, if required.

> >> > Other than that, I'm just not seeing how this is
> >> > contrary to the definition of a descriptor. Besides, what you are
> citing
> >> > is
> >> > a complex descriptor, which we have been modeling as a collection
> of
> >> > simple
> >> > descriptors, each of which would have a value and a MEANING.
> >>
> >> How/Where can I see this in the OWL?
> >
> > That's what I recall from the discussions regarding CHEMINF design
> > principles. If I'm wrong, please correct me Janna and Michel.
> > We personally have been using a functional hasValue attribute.
>
> Sorry, you lost me...

The idea is that complex descriptors are chemical descriptors that are composed of two or more chemical descriptors.

I added the axiom
That a chemical descriptor
('has direct part' min 2 'chemical descriptor' or 'has value' some Literal)


> >> Why not? Each first character of two or more SMILES does not refer
> to
> >> the same thing and cannot be compared, and therefore unsuitable for
> >> mathematical treatment... like 000065 and BODO:Descriptor...
> >> So, what I am struggling with is the lack of clear distinction
> >> (hierarchy) between mathematically/statistically usable descriptors
> >> (BODO:Descriptor, CHEMINF:000065) and those that are not
> >> (CHEMINF:000123)...
> >
> > At the same time, each SMILES string can be compared to another
> SMILES
> > string, as I said above, making both useable, and both fixed-length.
> > However, I may be missing something: are you planning to refer to a
> > character in a particular position in the SMILES string without
> tokenizing
> > it first and constructing a chemical graph?
>
> What other mathematical means did you have in mind of comparing two
> character strings?
>
> But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
> about mathematical treatment... for a SMILES string this is undefined.

Right - so I broadened this definition to contain SMILES strings.


m.

Egon Willighagen

unread,
Sep 11, 2010, 2:32:31 AM9/11/10
to cheminf-...@googlegroups.com
On Sat, Sep 11, 2010 at 12:08 AM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
>> Fair, but there is still no mathematically means of comparing two
>> values... which is required for mathematical treatment, as required by
>> CHEMINF:000065...
>
> Graph theory is now excluded from mathematics?

True, but the kind of mathematical treatment BODO:Descriptor is about,
is indeed not about graph theory, but about numerical, statistical
treatment.

>> >> here's the deal... 000123 is a 'chemical descriptor' and has
>> >> alternativeID BODO:Descriptor... now the latter is fixed length...
>> >> Now, there is also the equivalentClass 000065 ('molecular entity
>> >> descriptor) which is 0000123 about a 'molecular entity', which also
>> >> has this fixed-length hard-coded...
>> >>
>> >> And as the 'molecular entity' does not imply this, so my assumption
>> >> was that this was inherited from 000123...
>> >
>> > Let's see, taking into consideration what I said above, I don't see any
>> > problem here.
>>
>> And I do see a problem here.
>
> One string, one value, an evenly sized descriptor. I don't know what I can
> add to make it more clear. Maybe Michel and Janna can help me here.

CCO - CCN = ??

Statistical modeling is what 'molecular descriptor' in cheminformatics
are about... I cannot make it more clear than that either.

There is not answer to that... there is not mathematical operator that
tells you the outcome of CCO - CCN (the distance between two
molecules).

That is the whole point I am trying to make...

> my point stands: a SMILES string is one solid value instead of an array of
> values, and there is no contradiction to BODO:Descriptor.

Yes, it is, but that is so because the descriptor of BODO:Descriptor
has never been clear enough, apparently.

> Is there something else that you are not telling us? Like some parser
> somewhere that's hard-coded to not accept strings or something that you
> don't want to change over?

No, just the above. Just statistical modeling.

>> But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
>> about mathematical treatment... for a SMILES string this is undefined.
>
> Graph theory.

Which is not numerical, and which does not give you meaningful distances...

See my reply to Michel, about how we could proceed...

Egon Willighagen

unread,
Sep 11, 2010, 2:37:48 AM9/11/10
to cheminf-...@googlegroups.com
On Sat, Sep 11, 2010 at 12:18 AM, Michel_Dumontier
<Michel_D...@carleton.ca> wrote:
>> Fair, but there is still no mathematically means of comparing two
>> values... which is required for mathematical treatment, as required by
>> CHEMINF:000065...
>
> I've modified the definition of 'chemical descriptor' and 'molecular entity descriptor'
>
> 'A chemical descriptor is a data item (quantity or value) about a chemical entity that conforms to a specification for how it is calculated, measured or recorded.'
>
> 'a molecular entity descriptor is a chemical descriptor that provides information about a molecular entity.'

Thanx!

<snip>

> There is no 'fixed-length' requirement for CHEMINF descriptors. If this is the case for BODO:Descriptor, then they are not the same. I can add a "fixed-length" descriptor, if required.

Indeed. So, what I like to see is a separate class of descriptors that
can be used directly in statistical modeling (numerical, or ordinal,
values, which are the equivalent of BODO:Descriptor), or perhaps a
role for that... And BODO:Descriptor must be removed from
CHEMINF:000123 as oboInOwl:hasAlternativeId, as a SMILES is not a
BODO:Descriptor.

From this discussion it has become clear to me that fixed-length is
not the minimal requirement, but I hope in my answer to Leonid just
minutes ago, it is the ability to have a distance measure defined for
each value too (for numerical values, e.g. Euclidean, for 0,1 value
sets for example the Tanimoto...). Otherwise, statistical modeling is
not possible, which is the underlying idea behind BODO:Descriptor...

So, different indeed from CHEMINF:000123...

>> >> > Other than that, I'm just not seeing how this is
>> >> > contrary to the definition of a descriptor. Besides, what you are
>> citing
>> >> > is
>> >> > a complex descriptor, which we have been modeling as a collection
>> of
>> >> > simple
>> >> > descriptors, each of which would have a value and a MEANING.
>> >>
>> >> How/Where can I see this in the OWL?
>> >
>> > That's what I recall from the discussions regarding CHEMINF design
>> > principles. If I'm wrong, please correct me Janna and Michel.
>> > We personally have been using a functional hasValue attribute.
>>
>> Sorry, you lost me...
>
> The idea is that complex descriptors are chemical descriptors that are composed of two or more chemical descriptors.

And one descriptor consists of one 'field', one 'column' ? OK.

So, that leaves the question how and if the BODO:Descriptor equivalent
can be put in somewhere...

Egon Willighagen

unread,
Sep 11, 2010, 3:17:25 AM9/11/10
to cheminf-...@googlegroups.com
Hi Leonid,

I'm taking a few steps back...

On Fri, Sep 10, 2010 at 11:07 AM, Egon Willighagen
<egon.wil...@gmail.com> wrote:
> another question... I noted that SMILES (isomeric, canonical,
> CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
> latter is that it can be used in mathematical modeling (directly), and
> in particular it must have these two criteria:
>
> * fixed length
> * each number in the descriptor must have the same meaning

This was a very lousy definition, I now understand... perhaps an
operational definition may be more appropriate here:

* must be usable as direct input to multilinear regression and/or
decision trees.

(I'll have to update the chapter for the upcoming book for the
www.pharmbio.org course too regarding this... and (without sarcasm)
thank you very much for pointing out the lousiness of the first
criteria I gave...)

> Now, a SMILES does not adhere to these requirements...
>
> A more appropriate place would be under CHEMINF_000035, where it
> actually is available as CHEMINF_000020, but excluding the isomeric
> and canonical subclasses.

Here I was plain wrong, that should be clear now. That other group is
about formats, not about instances of those formats.

Greetings,

Leonid Chepelev

unread,
Sep 11, 2010, 12:24:42 PM9/11/10
to cheminf-...@googlegroups.com
Hi Egon,

Excellent. Well, we mustn't forget the fact that ontologies are explicit specifications of conceptualizations, and as such are often subjective reflections of one's reality. Given the fact that you are mostly concerned with statistical, and for a lack of a better word, arithmetic, manipulation of descriptor values, the descriptors in your reality have a somewhat different specification than the more general set of descriptors that we included in CHEMINF. 

Now, this doesn't mean that our paths necessarily have to diverge, and I am certain that your definition of a descriptor will find its place in the CHEMINF ontology, we just have to think of its place as a subclass of the general class of descriptors.

Another thing that you seemed to have sidestepped is the fact that it is not possible to convert the SMILES string that is an attribute of a molecule to an array of values without either tokenization, invoking graph theory, or explicit reconstruction of a chemical graph, making a SMILES string not an array of values _natively_, or as you would say (to some confusion of mine) numbers,  _natively_ but rather, a single value. Thus, though it does not satisfy the arithmetic manipulation requirement, it does satisfy your particular length requirement. We could agree or disagree, or even agree to disagree on this matter, but this doesn't really change much thanks to the common ontology to which you are now making another small contribution.

What's really important here is that thanks to a common ontology, we are able to formally and explicitly state what we mean, reproducibly perform computational analysis of our chemical data, and set up a language and dictionary of chemistry that could eventually be clearly and unambiguously spoken by the various software implementations. 

What's also important is that we don't necessarily have to agree for all the concepts to be present in a common ontology, there is a proper place for every concept! This unity in discourse, synergy in diversity afforded by ontologies, to me, is beautiful.

Cheers,

Leonid L. Chepelev

Janna Hastings

unread,
Sep 12, 2010, 6:02:37 AM9/12/10
to cheminf-...@googlegroups.com
Egon,

I think it will be well worth including a term to capture the sense of descriptor that you are getting at. What do you think of 'quantitative descriptor' as a possible name?

Cheers, Janna

Egon Willighagen

unread,
Sep 12, 2010, 6:09:19 AM9/12/10
to cheminf-...@googlegroups.com
On Sun, Sep 12, 2010 at 12:02 PM, Janna Hastings
<janna.h...@gmail.com> wrote:
> I think it will be well worth including a term to capture the sense of
> descriptor that you are getting at. What do you think of 'quantitative
> descriptor' as a possible name?

Well, I think I slightly disagree with Leonid on this... I believe the
cheminformatics community actually refers to these things as
'molecular descriptors'... I mean, they have a whole 'Handbook of
Molecular Descriptor' just about these things... or, as Wikipedia
writes: "The molecular descriptor is the final result of a logic and
mathematical procedure which transforms chemical information encoded
within a symbolic representation of a molecule into a useful number or
the result of some standardized experiment."

Anyway... I am not going to make a point out of this, as that was
never my intention...

I'd settle for 'numerical molecular descriptor'. I'd would prefer
something that reflecting the ability to calculate distances with a
distance measure directly on the values would be even better, but
cannot thing of something right now...

Reply all
Reply to author
Forward
0 new messages