Exception while scoring

206 views
Skip to first unread message

vraman...@gmail.com

unread,
Sep 4, 2014, 6:26:20 PM9/4/14
to h2os...@googlegroups.com
POST /2/Predict.json model=DeepLearning_a490309c5bfb0f2d9bcb2d3fedb64aae prediction=DeepLearningPredict_36edfb0345bb4bde87b65acf62d47598 data=....hex
03:21:46.891 FJ-0-63 WARN WATER: Numerical instability, predicted NaN.
java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
at hex.deeplearning.DeepLearningModel.score0(DeepLearningModel.java:1044)
at water.Model.score0(Model.java:480)
at water.Model$4.map(Model.java:282)
at water.MRTask2.compute2(MRTask2.java:404)
at water.MRTask2.compute2(MRTask2.java:365)
at water.MRTask2.compute2(MRTask2.java:365)
at water.H2O$H2OCountedCompleter.compute(H2O.java:634)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
03:21:47.151 # Session ERRR WATER:
+ java.lang.RuntimeException: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
+ at water.MRTask2.getResult(MRTask2.java:280)
+ at water.MRTask2.doAll(MRTask2.java:221)
+ at water.MRTask2.doAll(MRTask2.java:212)
+ at water.MRTask2.doAll(MRTask2.java:211)
+ at water.Model.scoreImpl(Model.java:276)
+ at water.Model.score(Model.java:246)
at water.Model.score(Model.java:214)
+ at hex.deeplearning.DeepLearningModel.score(DeepLearningModel.java:996)
+ at water.api.Predict.serve(Predict.java:38)
+ at water.api.Request.serveGrid(Request.java:165)
+ at water.Request2.superServeGrid(Request2.java:481)
+ at water.Request2.serveGrid(Request2.java:402)
+ at water.api.Request.serve(Request.java:142)
+ at water.api.RequestServer.serve(RequestServer.java:479)
+ at water.NanoHTTPD$HTTPSession.run(NanoHTTPD.java:424)
+ at java.lang.Thread.run(Thread.java:724)
+ Caused by: java.lang.UnsupportedOperationException: Trying to predict with an unstable model.
+ at hex.deeplearning.DeepLearningModel.score0(DeepLearningModel.java:1044)
+ at water.Model.score0(Model.java:480)
+ at water.Model$4.map(Model.java:282)
+ at water.MRTask2.compute2(MRTask2.java:404)
+ at water.MRTask2.compute2(MRTask2.java:365)
+ at water.MRTask2.compute2(MRTask2.java:365)
+ at water.H2O$H2OCountedCompleter.compute(H2O.java:634)
+ at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
+ at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
+ at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)
+ at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)
+ at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
+ at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

arno....@gmail.com

unread,
Sep 4, 2014, 6:35:25 PM9/4/14
to h2os...@googlegroups.com, vraman...@gmail.com
Hi Venkatesh,
This is a valid exception: It is "Trying to predict with an unstable model."
It is possible for H2O DeepLearning to end up with an unstable model, just like with any Neural Network, there can be exponential growth buildup from noise. If you had the ability to inspect the model on the Web UI, you would see the following message:

"Job was aborted due to observed numerical instability (exponential growth)."
+ "\nTry a different initial distribution, a bounded activation function or adding"
+ "\nregularization with L1, L2 or max_w2 and/or use a smaller learning rate or faster annealing."

Try adding L1/L2/max_w2 regularization options. What were your parameters?

Hope this helps.

Best regards,
Arno

vraman...@gmail.com

unread,
Sep 4, 2014, 6:43:06 PM9/4/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno..These are the parameters

epochs=2&
activation=RectifierWithDropout&
score_training_samples=0&
classification=1&
train_samples_per_iteration=100000
&rho=0.99&
epsilon=1.0E-8&
rate=0.0050&
input_dropout_ratio=0.2&
l1=1.0E-5&
classification_stop=-1&
variable_importances=true&...

Arno Candel

unread,
Sep 4, 2014, 6:48:05 PM9/4/14
to vraman...@gmail.com, h2os...@googlegroups.com
Venkatesh,

Is adaptive_rate=1 (i.e., left at default value)? If so, that means adaptive learning rate is used and the parameter ‘rate' is ignored, otherwise, ‘rho' and ‘epsilon' are ignored (those are only for adaptive learning rate).

If you’re running on multiple nodes, try reducing your ‘train_samples_per_iteration’, that might help convergence. Of course, that depends on your dataset size (cols, rows), and the number of hidden layers… can you share those numbers?

Thanks,
Arno

vraman...@gmail.com

unread,
Sep 4, 2014, 7:00:31 PM9/4/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno
->I left out adaptive rate ..but log says, true
04-Sep 10:44:54.145 10.190.125.185:54321 24363 FJ-0-23 INFO WATER: "adaptive_rate": "true",

->I'm running on single node
->2 hidden layers; 200,200
->dataset:
Training:
"num_cols": 980,
"num_rows": 14380546,

Test:
~ 3 million

Training finished successfully..
thanks again

Number of model parameters (weights/biases): 970,602

Arno Candel

unread,
Sep 4, 2014, 9:19:56 PM9/4/14
to vraman...@gmail.com, h2os...@googlegroups.com
Venkatesh,

Everything looks fine. Are you using Tanh or Rectifier (Rectifier is the new default now)? Rectifier numerics can blow up more easily, as they are unbounded. But I’ve never seen this happen unless I manually set the initial weights or used manually specified learning rate, momentum, etc.

Hmm, you said “training finished successfully”? So it works now? Or is the model still unstable? You can check model->model_info->unstable in the JSON response? If it says unstable=false, then the Job should have gotten cancelled, and it should report that somehow.

Thanks,
Arno

vraman...@gmail.com

unread,
Sep 4, 2014, 10:54:56 PM9/4/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com

I used Rectifier with Dropout
I get the exception when I try to score a test set..Once it's stuck at this, I can never get successfull response
I've seen this issue before..but don't know a reproduce-able way

I'm going to cut down the data-set & try again

Arno Candel

unread,
Sep 4, 2014, 11:07:06 PM9/4/14
to vraman...@gmail.com, h2os...@googlegroups.com
Venkatesh,

Yeah, the Rectifier can sometimes blow up (not sure why for you though). And then it’s too late, model building terminates and the model has the unstable bit set, which prevents it from making predictions (they’re useless). Are you saying it works sometimes? Might be due to intentional race conditions during multi-threaded training. Can you try to reduce train_samples_per_iteration (or use Tanh)?

Thanks,
Arno

vraman...@gmail.com

unread,
Sep 5, 2014, 12:40:40 AM9/5/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno..This is the 1st I'm trying to score on this data..My training data did n't change.
I'll try both of your suggestions.
Thanks

vraman...@gmail.com

unread,
Sep 5, 2014, 1:25:42 PM9/5/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
I tried Tanh..It took 10 hours (vs 2 hours) to train..But, I don't get the exception on scoring..
Thanks a lot for the suggestion, Arno.

Arno Candel

unread,
Sep 5, 2014, 1:38:47 PM9/5/14
to vraman...@gmail.com, h2os...@googlegroups.com
Venkatesh,

Yes, Tanh is slower - more back-propagation work, since its derivative is never zero. That also means that it take fewer Tanh neurons than Rectifier neurons to get the same job done. It depends on the dataset though, and on what kind of non-linearities are needed for feature generation. For some datasets, large Rectifier networks beat large Tanh networks.

I’m still surprised to see that Rectifier failed to converge, I cannot reproduce instabilities even with manual choice of really large learning rate and momentum. Did you try to reduce the number of training samples per iteration for Rectifier?

Thanks,
Arno

Arno Candel

unread,
Sep 5, 2014, 1:38:50 PM9/5/14
to vraman...@gmail.com, h2os...@googlegroups.com
Venkatesh,

Yes, Tanh is slower - more back-propagation work, since its derivative is never zero. That also means that it take fewer Tanh neurons than Rectifier neurons to get the same job done. It depends on the dataset though, and on what kind of non-linearities are needed for feature generation. For some datasets, large Rectifier networks beat large Tanh networks.

I’m still surprised to see that Rectifier failed to converge, I cannot reproduce instabilities even with manual choice of really large learning rate and momentum. Did you try to reduce the number of training samples per iteration for Rectifier?

Thanks,
Arno

On Sep 5, 2014, at 10:25 AM, vraman...@gmail.com wrote:

vraman...@gmail.com

unread,
Sep 5, 2014, 4:44:34 PM9/5/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno.
I'm running now with Rectifier/1000 samples (vs 100,000)..Will let you know
venkatesh

vraman...@gmail.com

unread,
Sep 8, 2014, 1:42:44 PM9/8/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com

Hi Arno:
Using rectifier resulted in unstable model..Same exception while scoring..but training finished..couple of hours more than Tanh.
FYI
venkatesh

arno....@gmail.com

unread,
Sep 8, 2014, 2:15:30 PM9/8/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Hi Venkatesh,

Thanks for trying this. Interesting, this is the first time I see Rectifier to lead to an unstable model with close-to-default arguments. The reason it was so slow is the overhead in communication (after every 1000 training points), which I was hoping could improve accuracies. The same test could have been achieved by using 1 node (no model averaging, no communication overhead).

One thing that puzzles me: Why didn't the model get cancelled automatically? I have checks everywhere to abort training once the model is unstable. I will look into this.

One thing to try for you is setting epochs to a small value (even less than 1.0), and see whether the model is stable after only a little bit of training. Do you know how the accuracy changes as a function of epochs?

Without access to the data, I can’t do much more, unfortunately.

Thanks,
Arno
> > >...

vraman...@gmail.com

unread,
Sep 8, 2014, 2:38:57 PM9/8/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno..Here's some results..It did drop a bit @2 epoch (my termination criteria)
AUC
80.5 - 0.001 epochs
87.26 ->0.5 epochs
87.79 -> 1.0
87.85 -> 1.5
87.28 > 2.0

I'll try with 0.5 as the improvement is n't significant after 0.5..

thanks
venkatesh

On Monday, September 8, 2014 11:15:30 AM UTC-7, arno....@gmail.com wrote:
> Hi Venkatesh,
>
>
>
> > > ...


arno....@gmail.com

unread,
Sep 9, 2014, 2:03:35 AM9/9/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Venkatesh,
Here's one more way to limit the Rectifier's activation function: set max_w2=10 instead of its default of infinity to limit the squared sum of incoming weights per neuron to 10.
Hope that helps,
Arno
> >...

vraman...@gmail.com

unread,
Sep 9, 2014, 1:23:08 PM9/9/14
to h2os...@googlegroups.com, vraman...@gmail.com, arno....@gmail.com
Thanks Arno, I'll give it a try
> > ...

Reply all
Reply to author
Forward
0 new messages