another question... I noted that SMILES (isomeric, canonical,
CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
latter is that it can be used in mathematical modeling (directly), and
in particular it must have these two criteria:
* fixed length
* each number in the descriptor must have the same meaning
Now, a SMILES does not adhere to these requirements...
A more appropriate place would be under CHEMINF_000035, where it
actually is available as CHEMINF_000020, but excluding the isomeric
and canonical subclasses.
Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
placed as full tree under CHEMINF_000035?
A second thing is that the SMARTS line notation (CHEMINF_000021) is
now enlisted under CHEMINF_000035, which is factually incorrect... can
I add this entity:
"molecular query information format specification" ?
SMARTS would fall under that, as well as MQL.
(And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)
Egon
--
Dr E.L. Willighagen
Post-doc @ Uppsala University (only until 2010-09-30)
Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
exactly as Leo states it.
>>
>> A second thing is that the SMARTS line notation (CHEMINF_000021) is
>> now enlisted under CHEMINF_000035, which is factually incorrect... can
>> I add this entity:
>>
>> "molecular query information format specification" ?
>>
>> SMARTS would fall under that, as well as MQL.
SMARTS strings represent substructure patterns - they can be used as
queries using appropriate software.
>>
> OK, so what would the description of molecular query information format
> specification would read? Because the current description is "An atomic
> connectivity molecular structure encoding format specification is a
> molecular structure encoding format specification which specifies a format
> in which the connectivity of atoms within a molecule may be specified."
> Could you pleas qualify your statements instead of asserting them?
>> (And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)
>
> Again, NO, as far as I understand, these are entirely different, one is a
> format specification and the other is an actual file.
>>
>> Egon
>>
>> --
>> Dr E.L. Willighagen
>> Post-doc @ Uppsala University (only until 2010-09-30)
>> Proteochemometrics / Bioclipse Group of Prof. Jarl Wikberg
>> Homepage: http://egonw.github.com/
>> LinkedIn: http://se.linkedin.com/in/egonw
>> Blog: http://chem-bla-ics.blogspot.com/
>> PubList: http://www.citeulike.org/user/egonw/tag/papers
>
--
Michel Dumontier
Associate Professor of Bioinformatics
Carleton University
http://dumontierlab.com
On Fri, Sep 10, 2010 at 7:50 PM, Leonid Chepelev
<leonid_...@hotmail.com> wrote:
> Hi, nice to see you looking at it again after the long pause!
Yeah, with the ACS now behind me, and a few book chapters almost done, ....
>> Now, a SMILES does not adhere to these requirements...
>
> Just as an aside, could you use the actual labels on these? In Protege, you
> can actually set the preferences to show labels, so that we don't need to
> look up what CHEMINF_XXXXXX is.
I still need to look up where exactly I can find this setting... I
feel to Protege what others feel to Git, I think...: it has somewhat
of a learning curve...
> Now, I do not quite see how a chemical descriptor, CHEMINF_000123 has to
> have a fixed length. One algorithm may modestly report a descriptor to two
> decimal places, while another may result in a value with five decimal
> places, all the while belonging to the same descriptor class.
The 'fixed length' refers to the number of numbers, not the precision...
mol1: 1.5678, 2.3, 4, 6, 8
mol2: 1.4, 2.765774, 4, 7, 8
These are both descriptors with the same, fixed length, even though
one has 6 atoms, the other has 7 atoms (4th column)... the number of
digits is indeed not important.
> As for each number in the descriptor having the same meaning, I don't really
> quite understand what you are trying to say here.
Each column must mean the same thing. So, a descriptor must not return
two numbers for one molecule, and four numbers for another. For each
'column' you must be able to calculate a meaningful distance...
>> A more appropriate place would be under CHEMINF_000035, where it
>> actually is available as CHEMINF_000020, but excluding the isomeric
>> and canonical subclasses.
>> Can I go ahead, and 'merge' CHEMINF_000018 with CHEMINF_000020 to be
>> placed as full tree under CHEMINF_000035?
>
> I really don't think so - one is a descriptor, and the other is format
> specification. Michel, Janna, correct me if I'm wrong, but the two have
> always been conceived as separate.
Ah, OK. Sorry about that. Yes, 00035 is the specification...
OK, I will look at it again... I thought it was falling under
'molecular descriptor', or something equivalent to that, which is not
correct...
Let me reboot back into a working machine, with Protege, and I will
explain myself more clearly... apologies for the lack of enough
detail.
> A descriptor adheres to a specification.
A SMILES is not a molecular descriptor: it does not have a fixed
length, and position 7 in the string does not have a well-defined
meaning; a SMILES cannot be used in computation.
SMILES is a line notation, and as such is a format specification.
>> A second thing is that the SMARTS line notation (CHEMINF_000021) is
>> now enlisted under CHEMINF_000035, which is factually incorrect... can
>> I add this entity:
>>
>> "molecular query information format specification" ?
>>
>> SMARTS would fall under that, as well as MQL.
>>
> OK, so what would the description of molecular query information format
> specification would read?
Something like:
"An encoding specifying a molecular search query."
> Because the current description is "An atomic
> connectivity molecular structure encoding format specification is a
> molecular structure encoding format specification which specifies a format
> in which the connectivity of atoms within a molecule may be specified."
Indeed. And a query does not one molecule or one molecular
substructure; a SMARTS is a fuzzy description of what a substructure
may look like, but it does not encode a molecular structure, it
encodes a query.
> Could you pleas qualify your statements instead of asserting them?
Please explain.
>> (And it seems to me CHEMINF_000100 and CHEMINF_000058 are duplicates?)
>
> Again, NO, as far as I understand, these are entirely different, one is a
> format specification and the other is an actual file.
OK. I has considered that, but that was not clear to me.
I'll get back on this asap...
SMILES is a specification, but the SMILES descriptor is a value that
adheres to SMILES, and provides information about molecular structure.
m.
--
> Something like:>
> "An encoding specifying a molecular search query."
>> Indeed. And a query does not one molecule or one molecular
> substructure; a SMARTS is a fuzzy description of what a substructure
> may look like, but it does not encode a molecular structure, it
> encodes a query.Hmm, that's an interesting way to look at it, and is probably a worthwhile addition. Can we reach a consensus, Michel, Janna? You could also look at SMARTS describing the structure of one generalized class of compounds.
Well... I'm sort of used to source code... not GUI clicking...
different worlds... really, I prefer looking at the OWL XML much more
than Protege...
> File>Preferences>Renderer(Tab)>Render entities using annotation values
Yeah, tried that again and again... but that labeling is not working
for me... it shows up for many items (rdfs:label), but fails for all
in CHEMINF... neither in 4.0 nor in 4.1 beta... no clue about that...
>> The 'fixed length' refers to the number of numbers, not the precision...
>>
>> mol1: 1.5678, 2.3, 4, 6, 8
>> mol2: 1.4, 2.765774, 4, 7, 8
>>
>> These are both descriptors with the same, fixed length, even though
>> one has 6 atoms, the other has 7 atoms (4th column)... the number of
>> digits is indeed not important.
>
> You are still not explaining how SMILES is not adhering to this.
c1ccccc1 (8 length descriptor)
CCO (3 length descriptor)
> If you mean
> we have SMILES that are fragmented and fragment order can change, to me it's
> a complex that the SMILES string refers to, and as such does not constitute
> an array of values.
Indeed... OK, I tried to make sense of my earlier observations...
here's the deal... 000123 is a 'chemical descriptor' and has
alternativeID BODO:Descriptor... now the latter is fixed length...
Now, there is also the equivalentClass 000065 ('molecular entity
descriptor) which is 0000123 about a 'molecular entity', which also
has this fixed-length hard-coded...
And as the 'molecular entity' does not imply this, so my assumption
was that this was inherited from 000123...
> Other than that, I'm just not seeing how this is
> contrary to the definition of a descriptor. Besides, what you are citing is
> a complex descriptor, which we have been modeling as a collection of simple
> descriptors, each of which would have a value and a MEANING.
How/Where can I see this in the OWL?
>> Each column must mean the same thing. So, a descriptor must not return
>> two numbers for one molecule, and four numbers for another. For each
>> 'column' you must be able to calculate a meaningful distance...
>
> Again, I'm not seeing how that makes the existing classification invalid.
Why not? Each first character of two or more SMILES does not refer to
the same thing and cannot be compared, and therefore unsuitable for
mathematical treatment... like 000065 and BODO:Descriptor...
So, what I am struggling with is the lack of clear distinction
(hierarchy) between mathematically/statistically usable descriptors
(BODO:Descriptor, CHEMINF:000065) and those that are not
(CHEMINF:000123)...
>> A SMILES is not a molecular descriptor: it does not have a fixed
>> length, and position 7 in the string does not have a well-defined
>> meaning; a SMILES cannot be used in computation.
>>
>> SMILES is a line notation, and as such is a format specification.
>
> I would second what Michel said regarding the relationship of a descriptor
> and the specification. Plus, I would second what I said just above, if we
> are referring to the SMILES descriptor with its xsd:string value.
Yes, got that and fully agree.
>> Something like:
>>
>> "An encoding specifying a molecular search query."
>>
>> Indeed. And a query does not one molecule or one molecular
>> substructure; a SMARTS is a fuzzy description of what a substructure
>> may look like, but it does not encode a molecular structure, it
>> encodes a query.
>
> Hmm, that's an interesting way to look at it, and is probably a worthwhile
> addition. Can we reach a consensus, Michel, Janna? You could also look at
> SMARTS describing the structure of one generalized class of compounds.
see below.
>> > Could you pleas qualify your statements instead of asserting them?
>>
>> Please explain.
>
> I was referring to your statement "A second thing is that the SMARTS line
> notation (CHEMINF_000021) is now enlisted under CHEMINF_000035, which is
> factually incorrect..."
The thing here is that they are supposed to apply to a 'chemical
entity' while SMARTS refers to a set of chemical entities.
> Asserting that something is "factually incorrect" is quite different and
> much less informative that qualifying your statement by explicitly
> specifying your arguments for or against a case. In other words, it's what
> makes the difference between absolutism and democracy.
OK :)
>> OK. I has considered that, but that was not clear to me.
>
> It could have been confusing because we only have one representative in that
> class. The idea was to grow it to all the file formats, say, that OpenBabel
> reads. Actually, come to think of it, this is extremely easy, and I could
> get to it soon (right after I finish this paper that I've said I would have
> by the end of this week).
The CDK defines quite a lot of formats too, to which you have
programmatic access...
>> I'll get back on this asap...
>
> May the force be with you. :D
Well, I rather have Protege cooperate a bit more :)
Any SMILES is actually a valid SMARTS, but... any search with InChI
would be a similarity match by default....
But that was not even what I was referring too, though an interesting
point... an InChI and a SMILES refer to one chemical entity
(polyatomic entity, to be precise), but MQL and SMARTS to do not refer
to single polyatomic entity, but a setOf...
> Something like MQL is definitely falling on the side of querying, but I
> suppose could be used to describe classes as well.
>
> But despite this ambiguity I am not against including a term for query
> encodings and putting SMARTS and MQL beneath, on the grounds that these
> formats are *mostly* used in that way.
In what way?
> InChI and SMILES can be used as queries too -- consider exact matches and> similarity matches.Any SMILES is actually a valid SMARTS, but... any search with InChI
would be a similarity match by default....
But that was not even what I was referring too, though an interesting
point... an InChI and a SMILES refer to one chemical entity
(polyatomic entity, to be precise), but MQL and SMARTS to do not refer
to single polyatomic entity, but a setOf...
> Something like MQL is definitely falling on the side of querying, but IIn what way?
> suppose could be used to describe classes as well.
>
> But despite this ambiguity I am not against including a term for query
> encodings and putting SMARTS and MQL beneath, on the grounds that these
> formats are *mostly* used in that way.
Can't get it more informative that what's on the screenshot...
E.
Fair, but there is still no mathematically means of comparing two
values... which is required for mathematical treatment, as required by
CHEMINF:000065...
>> here's the deal... 000123 is a 'chemical descriptor' and has
>> alternativeID BODO:Descriptor... now the latter is fixed length...
>> Now, there is also the equivalentClass 000065 ('molecular entity
>> descriptor) which is 0000123 about a 'molecular entity', which also
>> has this fixed-length hard-coded...
>>
>> And as the 'molecular entity' does not imply this, so my assumption
>> was that this was inherited from 000123...
>
> Let's see, taking into consideration what I said above, I don't see any
> problem here.
And I do see a problem here.
>> > Other than that, I'm just not seeing how this is
>> > contrary to the definition of a descriptor. Besides, what you are citing
>> > is
>> > a complex descriptor, which we have been modeling as a collection of
>> > simple
>> > descriptors, each of which would have a value and a MEANING.
>>
>> How/Where can I see this in the OWL?
>
> That's what I recall from the discussions regarding CHEMINF design
> principles. If I'm wrong, please correct me Janna and Michel.
> We personally have been using a functional hasValue attribute.
Sorry, you lost me...
>> Why not? Each first character of two or more SMILES does not refer to
>> the same thing and cannot be compared, and therefore unsuitable for
>> mathematical treatment... like 000065 and BODO:Descriptor...
>> So, what I am struggling with is the lack of clear distinction
>> (hierarchy) between mathematically/statistically usable descriptors
>> (BODO:Descriptor, CHEMINF:000065) and those that are not
>> (CHEMINF:000123)...
>
> At the same time, each SMILES string can be compared to another SMILES
> string, as I said above, making both useable, and both fixed-length.
> However, I may be missing something: are you planning to refer to a
> character in a particular position in the SMILES string without tokenizing
> it first and constructing a chemical graph?
What other mathematical means did you have in mind of comparing two
character strings?
But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
about mathematical treatment... for a SMILES string this is undefined.
> My choice of OpenBabel is dictated by the ease of programmatic access of the
> formats with OpenBabel, and possibly the biggest collection of
> interconvertible format types, while CDK has only a handful of readers with,
> at times, clunky/fussy access.
The CDK defines 99 file formats... how many formats does OB define?
I've modified the definition of 'chemical descriptor' and 'molecular entity descriptor'
'A chemical descriptor is a data item (quantity or value) about a chemical entity that conforms to a specification for how it is calculated, measured or recorded.'
'a molecular entity descriptor is a chemical descriptor that provides information about a molecular entity.'
> >> here's the deal... 000123 is a 'chemical descriptor' and has
> >> alternativeID BODO:Descriptor... now the latter is fixed length...
> >> Now, there is also the equivalentClass 000065 ('molecular entity
> >> descriptor) which is 0000123 about a 'molecular entity', which also
> >> has this fixed-length hard-coded...
> >>
> >> And as the 'molecular entity' does not imply this, so my assumption
> >> was that this was inherited from 000123...
> >
> > Let's see, taking into consideration what I said above, I don't see
> any
> > problem here.
>
> And I do see a problem here.
There is no 'fixed-length' requirement for CHEMINF descriptors. If this is the case for BODO:Descriptor, then they are not the same. I can add a "fixed-length" descriptor, if required.
> >> > Other than that, I'm just not seeing how this is
> >> > contrary to the definition of a descriptor. Besides, what you are
> citing
> >> > is
> >> > a complex descriptor, which we have been modeling as a collection
> of
> >> > simple
> >> > descriptors, each of which would have a value and a MEANING.
> >>
> >> How/Where can I see this in the OWL?
> >
> > That's what I recall from the discussions regarding CHEMINF design
> > principles. If I'm wrong, please correct me Janna and Michel.
> > We personally have been using a functional hasValue attribute.
>
> Sorry, you lost me...
The idea is that complex descriptors are chemical descriptors that are composed of two or more chemical descriptors.
I added the axiom
That a chemical descriptor
('has direct part' min 2 'chemical descriptor' or 'has value' some Literal)
> >> Why not? Each first character of two or more SMILES does not refer
> to
> >> the same thing and cannot be compared, and therefore unsuitable for
> >> mathematical treatment... like 000065 and BODO:Descriptor...
> >> So, what I am struggling with is the lack of clear distinction
> >> (hierarchy) between mathematically/statistically usable descriptors
> >> (BODO:Descriptor, CHEMINF:000065) and those that are not
> >> (CHEMINF:000123)...
> >
> > At the same time, each SMILES string can be compared to another
> SMILES
> > string, as I said above, making both useable, and both fixed-length.
> > However, I may be missing something: are you planning to refer to a
> > character in a particular position in the SMILES string without
> tokenizing
> > it first and constructing a chemical graph?
>
> What other mathematical means did you have in mind of comparing two
> character strings?
>
> But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
> about mathematical treatment... for a SMILES string this is undefined.
Right - so I broadened this definition to contain SMILES strings.
m.
True, but the kind of mathematical treatment BODO:Descriptor is about,
is indeed not about graph theory, but about numerical, statistical
treatment.
>> >> here's the deal... 000123 is a 'chemical descriptor' and has
>> >> alternativeID BODO:Descriptor... now the latter is fixed length...
>> >> Now, there is also the equivalentClass 000065 ('molecular entity
>> >> descriptor) which is 0000123 about a 'molecular entity', which also
>> >> has this fixed-length hard-coded...
>> >>
>> >> And as the 'molecular entity' does not imply this, so my assumption
>> >> was that this was inherited from 000123...
>> >
>> > Let's see, taking into consideration what I said above, I don't see any
>> > problem here.
>>
>> And I do see a problem here.
>
> One string, one value, an evenly sized descriptor. I don't know what I can
> add to make it more clear. Maybe Michel and Janna can help me here.
CCO - CCN = ??
Statistical modeling is what 'molecular descriptor' in cheminformatics
are about... I cannot make it more clear than that either.
There is not answer to that... there is not mathematical operator that
tells you the outcome of CCO - CCN (the distance between two
molecules).
That is the whole point I am trying to make...
> my point stands: a SMILES string is one solid value instead of an array of
> values, and there is no contradiction to BODO:Descriptor.
Yes, it is, but that is so because the descriptor of BODO:Descriptor
has never been clear enough, apparently.
> Is there something else that you are not telling us? Like some parser
> somewhere that's hard-coded to not accept strings or something that you
> don't want to change over?
No, just the above. Just statistical modeling.
>> But this is the whole point... BODO:Descriptor and CHEMINF:000065 talk
>> about mathematical treatment... for a SMILES string this is undefined.
>
> Graph theory.
Which is not numerical, and which does not give you meaningful distances...
See my reply to Michel, about how we could proceed...
Thanx!
<snip>
> There is no 'fixed-length' requirement for CHEMINF descriptors. If this is the case for BODO:Descriptor, then they are not the same. I can add a "fixed-length" descriptor, if required.
Indeed. So, what I like to see is a separate class of descriptors that
can be used directly in statistical modeling (numerical, or ordinal,
values, which are the equivalent of BODO:Descriptor), or perhaps a
role for that... And BODO:Descriptor must be removed from
CHEMINF:000123 as oboInOwl:hasAlternativeId, as a SMILES is not a
BODO:Descriptor.
From this discussion it has become clear to me that fixed-length is
not the minimal requirement, but I hope in my answer to Leonid just
minutes ago, it is the ability to have a distance measure defined for
each value too (for numerical values, e.g. Euclidean, for 0,1 value
sets for example the Tanimoto...). Otherwise, statistical modeling is
not possible, which is the underlying idea behind BODO:Descriptor...
So, different indeed from CHEMINF:000123...
>> >> > Other than that, I'm just not seeing how this is
>> >> > contrary to the definition of a descriptor. Besides, what you are
>> citing
>> >> > is
>> >> > a complex descriptor, which we have been modeling as a collection
>> of
>> >> > simple
>> >> > descriptors, each of which would have a value and a MEANING.
>> >>
>> >> How/Where can I see this in the OWL?
>> >
>> > That's what I recall from the discussions regarding CHEMINF design
>> > principles. If I'm wrong, please correct me Janna and Michel.
>> > We personally have been using a functional hasValue attribute.
>>
>> Sorry, you lost me...
>
> The idea is that complex descriptors are chemical descriptors that are composed of two or more chemical descriptors.
And one descriptor consists of one 'field', one 'column' ? OK.
So, that leaves the question how and if the BODO:Descriptor equivalent
can be put in somewhere...
I'm taking a few steps back...
On Fri, Sep 10, 2010 at 11:07 AM, Egon Willighagen
<egon.wil...@gmail.com> wrote:
> another question... I noted that SMILES (isomeric, canonical,
> CHEMINF_000018) is under CHEMINF_000123... now the idea behind the
> latter is that it can be used in mathematical modeling (directly), and
> in particular it must have these two criteria:
>
> * fixed length
> * each number in the descriptor must have the same meaning
This was a very lousy definition, I now understand... perhaps an
operational definition may be more appropriate here:
* must be usable as direct input to multilinear regression and/or
decision trees.
(I'll have to update the chapter for the upcoming book for the
www.pharmbio.org course too regarding this... and (without sarcasm)
thank you very much for pointing out the lousiness of the first
criteria I gave...)
> Now, a SMILES does not adhere to these requirements...
>
> A more appropriate place would be under CHEMINF_000035, where it
> actually is available as CHEMINF_000020, but excluding the isomeric
> and canonical subclasses.
Here I was plain wrong, that should be clear now. That other group is
about formats, not about instances of those formats.
Greetings,
Well, I think I slightly disagree with Leonid on this... I believe the
cheminformatics community actually refers to these things as
'molecular descriptors'... I mean, they have a whole 'Handbook of
Molecular Descriptor' just about these things... or, as Wikipedia
writes: "The molecular descriptor is the final result of a logic and
mathematical procedure which transforms chemical information encoded
within a symbolic representation of a molecule into a useful number or
the result of some standardized experiment."
Anyway... I am not going to make a point out of this, as that was
never my intention...
I'd settle for 'numerical molecular descriptor'. I'd would prefer
something that reflecting the ability to calculate distances with a
distance measure directly on the values would be even better, but
cannot thing of something right now...