levels NULL after applying as.factor

1,186 views
Skip to first unread message

raj...@gmail.com

unread,
Nov 10, 2017, 10:57:55 AM11/10/17
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

I'm applying the as.factor() function on the response column of an H2OFrame object but when I apply levels() on that I get a NULL value; however, is.factor() still gives TRUE. Would anyone know why this is happening?

Ultimately, the error I want to resolve is:

"Details: ERRR on field: _hidden: Model is too large: 184861614 parameters. Try reducing the number of neurons in the hidden layers (or reduce the number of categorical factors)"

My model has 67882 predictors and two hidden layers with 10 neurons each (and 4 categorical responses), so I think instead of the model being too big, the error message relates to the number of categorical factors which is somehow not being properly computed as indicated by the levels() returning NULL.

Any help would be much appreciated!

Thank you,
Rajat

Darren Cook

unread,
Nov 10, 2017, 11:52:03 AM11/10/17
to h2os...@googlegroups.com
> My model has 67882 predictors and two hidden layers with 10 neurons each (and 4 categorical responses), so I think instead of the model being too big, the error message relates to the number of categorical factors which is somehow not being properly computed as indicated by the levels() returning NULL.

Before you call as.factor(), what does h2o.describe() show for your H2O
frame? The cardinality column tells you how big each factor is.
(Assuming you are using R; it is not there in Python.)

Darren

raj...@gmail.com

unread,
Nov 10, 2017, 1:36:59 PM11/10/17
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Darren,

Yes I'm using R. The cardinality value for the response factor is indeed 4. and I checked that I'm getting the correct number in the Zeros column. Going back to the original error message then (copied below), it looks like the issue is with the model size after all. Could you please comment on the number of parameters - is it unusually large/more than typically seen?

"Details: ERRR on field: _hidden: Model is too large: 184861614 parameters. Try reducing the number of neurons in the hidden layers (or reduce the number of categorical factors)"

Thanks very much,
Rajat

Darren Cook

unread,
Nov 10, 2017, 4:42:37 PM11/10/17
to h2os...@googlegroups.com
> Yes I'm using R. The cardinality value for the response factor is indeed 4.

In an X x 10 x 10 x 4 network, you have 100 + 40 weights in the last two
layers. (+biases). So all your parameters must be between the X inputs
and the 10 first layer neurons: implying X is 18.5 million input neurons.

With 67,882 predictors you only need an average cardinality of 272 to
achieve that. But I'm guessing some have much more.

The h2o.describe() function will have told you cardinality for all the
columns. So just skim down it looking for any values over, say, 500.

Unless your domain knowledge tells you it is essential, I would exclude
any column with a cardinality above 500. (In fact, I'd want to be able
to justify including columns with cardinality above 100.)

BTW, if you get good results in the end, I'd be very curious about what
kind of domain it is, that you can extract useful information from
67,000 inputs with only 2 layers of 10 neurons. (I'm not saying it is
impossible, just that it goes against my intuition.)

Darren





--
Darren Cook, Software Researcher/Developer
My New Book: Practical Machine Learning with H2O:
http://shop.oreilly.com/product/0636920053170.do

raj...@gmail.com

unread,
Nov 24, 2017, 11:29:18 AM11/24/17
to H2O Open Source Scalable Machine Learning - h2ostream

Thanks Darren. The average cardinality was that high because I was using continuous RPKMs as inputs - now I'm trying with integer ceilings of the RPKMs. The model trains fine but now there's a problem of not being able to evoke newdata properly when trying h2o.predict or confusionMatrix using the H2OFrame of the test object. Could you please have a look at my post for this at https://groups.google.com/forum/#!topic/h2ostream/0taDdN8b0uk when you get a chance?

I tried to run h20.init with the ip and port manually specified, but I had to go with 127.0.0.1 since I'm on a cluster environment so the sys admin said that I don't have a fixed IP address (since there are many nodes available)...

Best regards,
Rajat

cand...@gmail.com

unread,
May 7, 2019, 9:18:14 PM5/7/19
to H2O Open Source Scalable Machine Learning - h2ostream

I trained stacked autoencoder for dimension reduction. I got the same problem. The input data is 5360x51000 table size. I create the stacked autoencoder with
(12000,6000,3000) layers. It raised the same error. How to handle in this case? did you slove the problem?

Darren Cook

unread,
May 8, 2019, 2:35:52 AM5/8/19
to h2os...@googlegroups.com
>> I'm applying the as.factor()... Ultimately, the error I want to
>> resolve is: "Details: ERRR on field: _hidden: Model is too
>> large:... ...
>
> I trained stacked autoencoder for dimension reduction. I got the same
> problem. The input data is 5360x51000 table size. I create the
> stacked autoencoder with (12000,6000,3000) layers. It raised the same
> error. How to handle in this case? did you solve the problem?

Did the replies to the above question help?

As always, a reproducible example is good, or at least showing some code
and/or some data (such as the summary of your model)

This kind of question is also well-suited to StackOverflow:
https://stackoverflow.com/questions/tagged/h2o (and
https://stackoverflow.com/help/mcve )
Reply all
Reply to author
Forward
0 new messages