That is impossible unless the model was build using those descriptors.
A model build with descriptors from software X version Y, cannot be
used with a model build from descriptors calculated with software X2
version Z.
> It turns out that ChemSpider automatically calculates a number of these
> descriptors and for factors that would affect solubility in methanol I would
> think that logP, number of H-bonding sites, polar surface area are among
> those of interest.
I was told in the past (from the source), that these values are *not*
open data... so, you cannot use those. This may have changed since
then, though... consequently, those values cannot be used in ONS.
Antony, if you are listening in... please give an update on that.
> Which ones were you going to use?
> For example see:
> http://www.chemspider.com/RecordView.aspx?id=21105565
>
> It would be nice at some point to have these automatically generated on a
> large scale but it might be tricky to make all of them public. That may in
> fact be a deciding point on which descriptors we choose to work with to
> remain fully open at all times.
That's where CDK/JOELib comes in...
> But in the meantime, Tony agreed that we could manually take 30-50 values
> for the descriptors calculated on the ChemSpider pages. Maybe he can
> comment on the licensing restrictions beyond that. Still that should be
> enough to get us started here.
Ah, there's the confirmation of my above comment... :)
BTW, don't expect much from a QSAR/QSPR model of <100 values... or
hardly anything serious at all. So, I see a strong conflict here...
> So I created another column for a link to the ChemSpider entry and a column
> for LogP. I'm asking all of our UsefulChem lab members (or anyone with an
> interest in contributing) who are available to help fill out this table. It
> will require creating records in ChemSpider in most cases and that is pretty
> intuitive now.
I really suggest using the CDK/JOELib for this... possibly via Bioclipse:
http://chem-bla-ics.blogspot.com/2007/10/more-qsar-in-bioclipse-joelib-extension.html
> The master table is here:
> http://spreadsheets.google.com/pub?key=plwwufp30hfpUERhse9y5Kw
If Rajarshi does not do it before I do, I'll probably have time to do
this late next week.
But again, without at least some 100 compounds, these models will most
likely be practically meaningless... too much numerical freedom to
make a regression... let alone, any meaningful correlation... not with
these kind of general descriptors, anyway...
Egon
That is impossible unless the model was build using those descriptors.
A model build with descriptors from software X version Y, cannot be
used with a model build from descriptors calculated with software X2
version Z.
BTW, don't expect much from a QSAR/QSPR model of <100 values... or
hardly anything serious at all. So, I see a strong conflict here...
I really suggest using the CDK/JOELib for this... possibly via Bioclipse:
> So I created another column for a link to the ChemSpider entry and a column
> for LogP. I'm asking all of our UsefulChem lab members (or anyone with an
> interest in contributing) who are available to help fill out this table. It
> will require creating records in ChemSpider in most cases and that is pretty
> intuitive now.
http://chem-bla-ics.blogspot.com/2007/10/more-qsar-in-bioclipse-joelib-extension.html
But again, without at least some 100 compounds, these models will most
likely be practically meaningless... too much numerical freedom to
make a regression... let alone, any meaningful correlation... not with
these kind of general descriptors, anyway...
Theoreticallly they are... practically they are not. LogP in
particular often itself is a QSPR model.
Even surface area requires the use of (numerically) identical atomic
radii... and of course
things like atom type perception, etc, etc One simple minor bug in any
of these will change the values.
This is why projects like the BODR and the chemoinformatics algorithm
ontology are important.
> > BTW, don't expect much from a QSAR/QSPR model of <100 values... or
> > hardly anything serious at all. So, I see a strong conflict here...
>
> Well that's why I'm having the discussion with people who have experience.
> The experiments are fairly quick to perform once students get the hang of
> it. Hopefully we'll reach that 100 mark this term.
Indeed.
> > > So I created another column for a link to the ChemSpider entry and a
> > > column for LogP. I'm asking all of our UsefulChem lab members (or anyone
> > > with an interest in contributing) who are available to help fill out this table.
> > > It will require creating records in ChemSpider in most cases and that is
> > > pretty intuitive now.
> >
> > I really suggest using the CDK/JOELib for this... possibly via Bioclipse:
> >
> http://chem-bla-ics.blogspot.com/2007/10/more-qsar-in-bioclipse-joelib-extension.html
>
> Awesome. Yes the more we can get calculations from CDK the better
It is possible to integrate Dragon too, but without funding I do not
have time for that.
They are covered by the same rules/copyright as any other data?
So, UsefulChem can set up a ACD/Labs LogP database for any compound
they are interested in, under an OpenData license?
That's useful for the CDK too, for benchmark purposes. Tony, what
about the experimental values? ACD/Labs extracted a database from
literature I assume?
Different logP prediction algorithms from different groups will give
different values. For the common molecules...octane, benzene, blah,
blah, they will all be within error but "meaningful compounds" will
likely be very different. Also, take care with logP versus logD
distinctions. Most people forget that. LogP numbers whether from the
CDK or from ChemSpider (either XlogP, AlogPS or ACD/LogP (we have
THREE logP predictors on there!) can be part of your descriptor feed
and its all a matter of whether you are after accurate predictions or
segregation. As I recall you are trying to derive segregation -
precipitates or not.
Solubility is not easy to predict. The pharma industry has been
looking for a a good aqueous solubility predictor for years. I was
involved in the development of one for a few years -
http://www.acdlabs.com/products/phys_chem_lab/aqsol/
There are others out there but I am not aware of a good open source
one and certainly not one for methanol since there isn't a good
training set to work from...the measurement of solubility is
inherently problematic.
Since you are looking for soluble/insoluble it's easier but the
training set you have is small. I look forward to seeing Rajarshi's
progress with it. Since Greg is so close to you you might be
interested in his opinion on this since he spent years in the pharma
industry and has led the Aq Sol product development since joining ACD/
Labs. I promise you he won't be trying to sell you something...he
deeply cares about academia and was teaching at a local college in
Philly (and still teaches there once in awhile) and will be more than
willing to help.