solubility QSPR

2 views
Skip to first unread message

Rajarshi

unread,
Jan 12, 2008, 4:15:15 PM1/12/08
to UsefulChem
I've built a number of models to predict the solubility class of the
30 compounds that were reported in the table.

The results are located at http://rguha.ath.cx/~rguha/cicc/ugi.qspr/

As expected, the models are not very reliable. Indeed for the linear
discriminant analysis models, it looks like they are more or less
chance correlations (based on the scrambling results).

On the other hand the RP and RF models are a little better, though the
small sample size does lead to lack of stability in the RF model.

Finally, using ensemble predictions gives us 100% correct predictions
- but this is a very over-optimistic result, based on the current
dataset. I wouldn't be too excited about this!

Are the experiments, regeants very expensive or time consuming? If
not, it would be useful if someone were to choose to make some more
compounds and observe their solubility. At the same, I would make
predictions for the compounds. At the end we'd compare results.
Whatever the results, they'd be used to rebuild the model, hopefully
making it better.

Egon Willighagen

unread,
Jan 13, 2008, 2:19:56 AM1/13/08
to usefu...@googlegroups.com
On Jan 12, 2008 10:15 PM, Rajarshi <rajars...@gmail.com> wrote:
> I've built a number of models to predict the solubility class of the
> 30 compounds that were reported in the table.
>
> The results are located at http://rguha.ath.cx/~rguha/cicc/ugi.qspr/

In good nature of open notebook science... where are the R scripts?
Sweave is perfect for this. Principally yields a PDF, but might be
convertable to HTML too...

This could go into some central repository, so that I could easily
contribute patches...

> As expected, the models are not very reliable. Indeed for the linear
> discriminant analysis models, it looks like they are more or less
> chance correlations (based on the scrambling results).

Y_real vs Y_pred are really insightful for this... may I hereby request them?

Egon

--
----
http://chem-bla-ics.blogspot.com/

Rajarshi

unread,
Jan 13, 2008, 10:03:39 AM1/13/08
to UsefulChem
On Jan 13, 2:19 am, "Egon Willighagen" <egon.willigha...@gmail.com>
wrote:
> On Jan 12, 2008 10:15 PM, Rajarshi <rajarshi.g...@gmail.com> wrote:
>
> > I've built a number of models to predict the solubility class of the
> > 30 compounds that were reported in the table.
>
> > The results are located athttp://rguha.ath.cx/~rguha/cicc/ugi.qspr/
>
> In good nature of open notebook science... where are the R scripts?

Indeed. The page is updated with the script and required package

> This could go into some central repository, so that I could easily
> contribute patches...

Does the UsefulChem project have a VC repo?

> Y_real vs Y_pred are really insightful for this... may I hereby request them?

I've put up a file containing the predicted categories for the models
listed in the table (i.e., those with % correct > 95%). Are you also
asking for the scrambled predictions? Given that there are 300 of
those for each model, it might be easier to just run the scrambling
code and take a look at it by hand.

Egon Willighagen

unread,
Jan 13, 2008, 10:25:21 AM1/13/08
to usefu...@googlegroups.com
On Jan 13, 2008 4:03 PM, Rajarshi <rajars...@gmail.com> wrote:
> I've put up a file containing the predicted categories for the models
> listed in the table (i.e., those with % correct > 95%). Are you also
> asking for the scrambled predictions? Given that there are 300 of
> those for each model, it might be easier to just run the scrambling
> code and take a look at it by hand.

Oh, sorry... was still thinking about the problem as a regression problem.

You could plot those 300 in a ROC plot... that would be interesting...

Rajarshi

unread,
Jan 13, 2008, 10:40:08 AM1/13/08
to UsefulChem
Sure - but given the fact that I don't really think the LDA models are
that great, I didn't go into much detail. I should do the ROC curves
for the RF/RP models though

Egon Willighagen

unread,
Jan 13, 2008, 10:42:44 AM1/13/08
to usefu...@googlegroups.com
On Jan 13, 2008 4:40 PM, Rajarshi <rajars...@gmail.com> wrote:

> Egon Willighagen wrote:
> > You could plot those 300 in a ROC plot... that would be interesting...
>
> Sure - but given the fact that I don't really think the LDA models are
> that great, I didn't go into much detail. I should do the ROC curves
> for the RF/RP models though

Well, a ROC curve for the LDA models would support your judgement on that too...

Egon Willighagen

unread,
Jan 13, 2008, 10:43:09 AM1/13/08
to usefu...@googlegroups.com
On Jan 13, 2008 4:42 PM, Egon Willighagen <egon.wil...@gmail.com> wrote:
> Well, a ROC curve for the LDA models would support your judgement on that too...

Not that I am not trusting that... :)

Rajarshi

unread,
Jan 13, 2008, 6:32:07 PM1/13/08
to UsefulChem


On Jan 13, 10:42 am, "Egon Willighagen" <egon.willigha...@gmail.com>
wrote:
>
> > Sure - but given the fact that I don't really think the LDA models are
> > that great, I didn't go into much detail. I should do the ROC curves
> > for the RF/RP models though
>
> Well, a ROC curve for the LDA models would support your judgement on that too...

I've put up ROC curves for the models (31 LDA models, 1 RF and 1 RP
model). I don't think it's much use to generate ROC curves for
scrambled models - what woud they tell us that the percent correct
doesn't show?
Reply all
Reply to author
Forward
0 new messages