error - trying to predict with an unstable model

697 views
Skip to first unread message

wadet...@gmail.com

unread,
Dec 12, 2014, 4:29:32 PM12/12/14
to h2os...@googlegroups.com
I've been exploring the deep learning tool through the R interface and hitting some unstable model errors when I switch to cross-validation by setting the nfolds=v flag. I'm not an experienced neural net user, so may be experiencing these issues due to parameter settings, although i've tried a number of different settings. Running with validation on the test set didin't throw the error, just when I switched to cross-validation.

This is for a small multinomial (3-class) model (down-sampled to n=1600 rows with 45 to 200 features depending on if I run feature selection prior to). I've played around with not down-sampling prior to and setting balance_classes = TRUE and changing around the number of predictors included, but it will still throw this error frequently, although not always. Suggestions on approaches to diagnose would be appreciated.

Here's the model and error, unfortunately can't share data.

grid_search = h2o.deeplearning(x = xTrainIdx, y = yTrainIdx,
data = trainDat.hex,
nfolds=10,
activation = "Rectifier",
hidden = list(c(50,50), c(100,100)),
epochs = c(0.5,1,2),
l1 = c(1e-5,1e-7),
classification = TRUE,
variable_importances = TRUE)

Polling fails:
<simpleError in .h2o.__poll(client, job_key): Got exception 'class java.lang.RuntimeException', with msg 'java.lang.RuntimeException: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.'
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
at hex.GridSearch.execImpl(GridSearch.java:36)
at water.Func.exec(Func.java:42)
at water.Job$3.compute2(Job.java:333)
at water.H2O$H2OCountedCompleter.compute(H2O.java:653)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.RuntimeException: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
at water.MRTask2.getResult(MRTask2.java:280)
at water.MRTask2.doAll(MRTask2.java:221)
at water.MRTask2.doAll(MRTask2.java:212)
at water.MRTask2.doAll(MRTask2.java:211)
at water.Model.scoreImpl(Model.java:278)
at water.Model.score(Model.java:248)
at water.Model.score(Model.java:216)
at hex.deeplearning.DeepLearningModel.score(DeepLearningModel.java:1030)
at hex.deeplearning.DeepLearning.crossValidate(DeepLearning.java:1309)
at water.util.CrossValUtils.crossValidate(CrossValUtils.java:32)
at hex.deeplearning.DeepLearning.execImpl(DeepLearning.java:756)
... 8 more
Caused by: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
at hex.deeplearning.DeepLearningModel.score0(DeepLearningModel.java:1079)
at water.Model.score0(Model.java:482)
at water.Model$4.map(Model.java:284)
at water.MRTask2.compute2(MRTask2.java:404)
... 6 more

Any thoughts would be appreciated.

Thanks,
Wade

Sri Ambati

unread,
Dec 13, 2014, 2:41:37 AM12/13/14
to wadet...@gmail.com, h2os...@googlegroups.com
Wade,

As you rightfully figured you are running into something on nfold cross-validation. Will open a JIRA - It'd be great to get some characteristics / "summary" of the dataset triggering this if possible.

Thanks for the report,
Sri
> --
> You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

wadet...@gmail.com

unread,
Dec 15, 2014, 10:02:51 AM12/15/14
to h2os...@googlegroups.com
Sri,

Thanks for the reply. I played around a bit with the data set and converted some of the highly skewed predictors (proportions) to binary flags. This indeed fixed the unstable model problem for this analysis and now allows me to predict within cross validation, so it appears the error is warranted for the sloppy data I was testing with.

As a side note for comparison, i've been using caret to tune models (forest, glmnet, simple nnet) using this and similar datasets without converting to binary, which tends to produce seemingly stable models with consistent performance on holdout sets. At one point I had compared these models to those using the binary transformed data, and had marginally improved performance with the non-transformed data, hence my preference to date. I can tune a GBM with h2o OK on the non-transformed data, but the deep learner appears to dislike the data in its original form.

I could provide a summary of the data if it'd still be helpful, just let me know.

Thanks again for the offer to help,
Wade



On Saturday, December 13, 2014 2:41:37 AM UTC-5, srisatish wrote:
> Wade,
>
> As you rightfully figured you are running into something on nfold cross-validation. Will open a JIRA - It'd be great to get some characteristics / "summary" of the dataset triggering this if possible.
>
> Thanks for the report,
> Sri
>

arno....@gmail.com

unread,
Dec 15, 2014, 11:16:04 AM12/15/14
to h2os...@googlegroups.com, wadet...@gmail.com
Hi Wade,

A few notes/suggestions:

1) The small dataset size of only a few thousand rows can be more prone to harmful numerical effects due to multithreading race conditions (we use "Hogwild!" approach). I suggest trying "reproducible=TRUE" or "force_load_balance=FALSE" to disable or limit multi-threading, or, alternatively, not downsampling to such a small dataset size and keeping all threads active. One theory to explain the observed increased stability is that your transformations to binary flags increased the number of the first hidden neuron weights and hence reduced race conditions.

2) If 1) didn't help, I suggest adding "max_w2=10, l2=1e-5" to your arguments (to keep the "Rectifier" from exploding), or, alternatively, switching to the "Tanh" activation function, which is bounded naturally.

3) Once it appears stable, I suggest also trying a "better" network, i.e., one with more hidden neurons and training for more epochs:

hidden = list(c(50,50), c(100,100), c(200,200), c(200,200,200)),
epochs = c(1,10,100,1000)

Note you'll still need grid search for the epochs when doing cross-validation, as all cv-models train up to the specified number of epochs, and there is no early stopping based on validation error even if replace_with_best_model=TRUE - which can otherwise lead to overfitting on the validation set (if specified) for small data.

Hope this helps, please let me know if you have questions,
Arno

wadet...@gmail.com

unread,
Dec 15, 2014, 3:43:42 PM12/15/14
to h2os...@googlegroups.com, arno....@gmail.com
Hi Arno,

Thanks for the suggestions, it looks like your 1) option is dead on - if I force_load_balance=FALSE or don't down-sample a priori it runs fine. Trying to add additional penalizations (max_w2, l2) didn't help, it still required force_load_balance=F after down sampling to run.

This brings up a related question - what's going on under the hood with up/down sampling when balance_classes=T? For example if I down sample to equal class frequencies, and then run the model with and without balance_classes=T, I would expect similar models since the data is already balanced, but it produces very different models. In the R documentation the default values for e.g. class_sampling_factors and max_after_balance_size aren't listed so i'm curious on how to set up properly.

Thanks for the help,
Wade

Arno Candel

unread,
Dec 15, 2014, 5:08:47 PM12/15/14
to wadet...@gmail.com, h2os...@googlegroups.com
Hi Wade,

Glad to hear that my suggestion was helpful.  As for up/down-sampling the classes internally vs externally, the predicted probabilities are corrected when using balance_classes=T, using the ratios between the prior class probabilities (class distribution in unsampled training data) and the up/down-sampled ones.  If you do it externally, you need to account for that sampling manually, by modifying the predicted probabilities yourself before comparing with the results from balance_classes=T, otherwise you’re modeling a different problem (different data distribution -> different results).  See http://gking.harvard.edu/files/0s.pdf Eq.(27) for more information.

The default values and additional descriptions for the parameters can be seen in the Java code:


/**
* Desired over/under-sampling ratios per class (lexicographic order). Only when balance_classes is enabled. If not specified, they will be automatically computed to obtain class balance during training.
*/
@API(help = "Desired over/under-sampling ratios per class (lexicographic order).", filter = Default.class, dmin = 0, json = true, importance = ParamImportance.SECONDARY)
public float[] class_sampling_factors;



/**
* When classes are balanced, limit the resulting dataset size to the
* specified multiple of the original dataset size.
*/
@API(help = "Maximum relative size of the training data after balancing class counts (can be less than 1.0)", filter = Default.class, json = true, dmin=1e-3, importance = ParamImportance.EXPERT)
public float max_after_balance_size = 5.0f;


For example, if you have two classes with prior probabilities for class 0 of 0.99 and for class 1 of 0.01, meaning that class 0 shows up 99 times are often as class 1, then the default class_sampling factors will be 1 and 99, which means that class 0 will be left alone and class 1 will be up-sampled 99 times to have equal representation as class 0.  The total dataset size will then be roughly 2x the original dataset, and if that’s less than max_after_balance_size (default: 5) times the original dataset, the sampling factors will be left alone.  Otherwise, all sampling factors would be reduced by a scaling factor to end up with the final dataset max_after_balance_size as large as the original dataset.

Hope this helps,
Arno

wadet...@gmail.com

unread,
Dec 16, 2014, 9:43:54 AM12/16/14
to h2os...@googlegroups.com, wadet...@gmail.com, arno....@gmail.com
Hi Arno,

Thanks for the reply and helpful info on sampling, that'll help my own understanding to pick through the code and match up to King and Zeng.

Related to the original issue and potential race conditions, is there a rule of thumb for dataset size at which Hogwild! could start interacting with numerical stability issues? For example I attempted to fit a more complex model (your options 2&3 from original post) using the original non-sampled dataset (only n~7000 observations), and still hit the "Trying to predict with an unstable model" error. Toggling on force_load_balance=FALSE fixes the problem, but when I do it only uses ~15-20% of my CPUs on average (8 total) so not gaining much speed boost over single threaded. This is all assuming I have sloppy dataset, since a bit of preprocessing of the highly skewed data can also fix up the stability as in my earlier post, but i'm curious in mapping out how h2o responds to different scenarios I may encounter.

Thanks,
Wade

Arno Candel

unread,
Dec 16, 2014, 1:23:33 PM12/16/14
to wadet...@gmail.com, h2os...@googlegroups.com
Hi Wade,

Hogwild! is affected by both the network size and the dataset size.  Small networks (also affected by the number of columns of the dataset - input layer neurons) means small weight matrix and higher risk of collisions between threads.  Small dataset size means that the same rows get trained over and over again, which often results in updates to the same weights.  Note that for the Rectifier activation function, the derivative is often zero and back propagation only trains the weights affected by non-zero derivatives (which can be a similar subset for repeat epochs).

You could try to manually set train_samples_per_iteration to a large value (say 100,000) to force many passes over the dataset per map phase.  Of course, the number of epochs should be sufficiently large as well (>>14).

There’s no systematic study other than a reproducibility (reproducible=T/F) study that’s running as a JUnit test on a dataset with only 380 rows:

12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: Reproducibility: on
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: Repeat # --> Validation Error
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: [0 --> 0.123943664
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  1 --> 0.123943664
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  2 --> 0.123943664
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  3 --> 0.123943664
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  4 --> 0.123943664
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  5 --> 0.123943664]
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: 
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: Reproducibility: off
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: Repeat # --> Validation Error
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO: [0 --> 0.16901408
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  1 --> 0.16619718
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  2 --> 0.16056338
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  3 --> 0.17183098
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  4 --> 0.12957746
12-16 10:21:00.510 172.16.2.19:54323     83908  main      INFO:  5 --> 0.14366198]
12-16 10:21:00.511 172.16.2.19:54323     83908  main      INFO: mean error: 0.15680750956137976

You can see a large variance for the multi-threaded (non-reproducible) case.

Hope this helps,
Arno

Wade Cooper

unread,
Dec 16, 2014, 1:37:49 PM12/16/14
to Arno Candel, h2os...@googlegroups.com
Hi Arno, 

Thanks again for your timely and thorough response, it's much appreciated!  This gives me more options to play with, along with learning more about building deep learners in general.

Wade

Reply all
Reply to author
Forward
0 new messages