johnwa...@gmail.com
unread,Apr 22, 2015, 4:18:34 PM4/22/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to h2os...@googlegroups.com
Hello,
First up, I just started to play with h2o and thanks for the h2o team for this great software.
I have a question for a toy regression example below, where the y variable is
a function of X1 and X2, plus some noise. The X1, X2, and epsilon are all random norms: Y= 0.05*X1 + 0.025*X2 + 0.05*X1*abs(X2)+epsilon
I split data into half, train a model in first half, then predict for second half. With OLS or deepnet, I get pretty similar results(at least for deepnet+sigmoid activation).
But with h2o's deeplearning, it's hard for me to match OLS or deepnet's performance.
I show the results for OLS, deepnet and h2o.deeplearning results below, also the R code. I use 2 layers with 40 neurons each for comparison.
1) -- summary of outsample and insample R2s for a few setups:
hidden R2.out R2.in notes
A). OLS
NA 0.008678 0.005985
B). deepnet: 5 different setups
40,40 0.008600974 0.005916587 act="sigm"
40,40 0.008635492 0.005937487 act="sigm", another run
40,40 0.007608 0.008087 act="tanh"
40,40 0.008037 0.00571 act="tanh", hidden dropout=0.5, visible dropout=0.5
40,40 0.00755 0.005695 act="tanh", hidden dropout=0.75, visible dropout=0.75
C). h2o: 3 sample runs
40,40 0.003824 0.01321 act="tanh", input_dropout_ratio = c(0),hidden_dropout_ratios = c(0.5,0.5),
40,40 0.007317 0.008989 act="tanh", input_dropout_ratio = c(0.5),hidden_dropout_ratios = c(0.5,0.5),
40,40 0.006079 0.009535 another run as above
From above, OLS shows best out-sample performance. This could make sense as the data in this case is quite linear(
there is some small nonlinearity from multiplication, but I'm not sure the neural net was able to model that).
Anyway, with deepnet, at least with actvation=sigmoid, I could easily reproduce almost exactly same results as OLS. With deepnet+tanh, the out-sample performance is slightly worse.
But with h2o, we only have tanh(no sigmoid), and I've tried many settings, very hard to match OLS.
In short, in this simple toy example, h2o is much harder to train to match up deepnet
or OLS's outsample performance. I wonder whether it is due to my selection of parameters or something
wasn't setup correctly ?
2). questions:
a). does h2o output unit also uses a linear function (not tanh)?
The deepnet doesn't transform output unit with sigmoid for regressions by default.
I would assume h2o does the same, but haven't found it in its documentation.
b). What bothers me most is that, for h2o, the predicted value has pretty weird distributions
sometimes(compared to OLS or deepnet's predictions). This can happen quite often, depending on parameter settings.
For example, sometimes, the distribution of predicted values could concentrate around the mean value, or sometimes, it could have 2 peaks, as shown in the plot I attached. For OLD and deepnet, usually the predicted value distribution is more smooth.
What caused h2o's uneven prediction distributions? Is it just random and why it happens often?
(ok, seems not easy to attach a picture in google group, I'll just say that OLS and deepnet predictions are more like Gaussian with peaks around zero, but h2o predictions in that particular run is not like Gaussian at all with 2 peaks at -0.1 and 0.1)
c). was it because of tanh vs sigmoid ? Why tanh in deepnet also did worse than sigmoid?
For deeptnet, I scaled my y variable into [0,1]. For h2o, it's raw y variables(roughly mean 0 and std=1), but
according to h2o documentation, it automatically scales it to have mean 0 and unit std.
I would appreciate any comments on h2o vs deepnet difference, and also how to tune h2o to match deepnet or OLS in this examples.
Thanks,
John
3). ----- R code
toy regression example with OLS, deepnet, and h2o deeplearning methods.
###### 1). data preparation
### X1 and X2 are rnorm
### Y= 0.05*X1 + 0.025*X2 + 0.05*X1*abs(X2)+epsilon
set.seed(301)
N=5040 # number of rows
data <- as.data.frame( cbind(1:N, matrix(rnorm(2*N), ncol=2)) )
colnames(data) <- c("ID",paste("X",1:2,sep=""))
# noise
noise<- as.data.frame(rnorm(N))
names(noise)="noise"
b1=0.05
b2=0.025
b3=0.05
stdNoise=1
#yvar
yvar= as.data.frame(b1*data$X1+b2*data$X2+b3*data$X1*abs(data$X2)+stdNoise*noise)
names(yvar)=c("Y")
#combine all
data=cbind(data,yvar, noise)
#add scaled y such that it's between 0-1, for deepnet
min=min(data$Y)
max=max(data$Y)
scale=max-min
data$YScl<- (data$Y -min)/scale
# split data into train and test
dataTrain<- data[data$ID< 2600,]
dataTest<- data[data$ID>= 2600,]
summary(data)
> summary(data)
ID X1 X2 Y noise YScl
Min. : 1 Min. :-3.433766 Min. :-3.201729 Min. :-3.697055 Min. :-3.580720 Min. :0.0000
1st Qu.:1261 1st Qu.:-0.663225 1st Qu.:-0.683403 1st Qu.:-0.692370 1st Qu.:-0.673919 1st Qu.:0.4022
Median :2520 Median :-0.027207 Median :-0.009168 Median : 0.017945 Median : 0.015181 Median :0.4972
Mean :2520 Mean :-0.009551 Mean : 0.000382 Mean :-0.003449 Mean :-0.002627 Mean :0.4944
3rd Qu.:3780 3rd Qu.: 0.666797 3rd Qu.: 0.663464 3rd Qu.: 0.686218 3rd Qu.: 0.689008 3rd Qu.:0.5867
Max. :5040 Max. : 3.886974 Max. : 3.509467 Max. : 3.774207 Max. : 3.756700 Max. :1.0000
##### 2). simple OLS
## train
fit1<-lm(Y ~ X1+X2,data=dataTrain)
summary(fit1)
predOLSIn=fit1$fitted
# test
predOLS<- predict.lm(fit1, newdata=dataTest)
plot(predOLS,dataTest$Y)
fit2<-lm(Y ~ predOLS,data=dataTest)
summary(fit2)
##### 3). deepnet: deep neural network by Xiao Rong
library(deepnet)
# deepnet wants x as matrix, also I feed scaled y to deepnet
xTrain <- as.matrix(model.matrix(~ X1+X2-1, dataTrain))
xTest <- as.matrix(model.matrix(~ X1+X2-1, dataTest))
yTrain <- dataTrain$YScl # note y is scaled
## deepnet uses sae( stacked auto-encoder) as pre-training for deep neural network
dnn <- sae.dnn.train(xTrain,
yTrain,
hidden = c(40,40), # 2 hidden layers are 40 neurons each
activationfun = "tanh", # can be sigm as well
learningrate = 0.01,
numepochs = 200,
hidden_dropout = 0.5,
visible_dropout = 0.5,
sae_output="linear", # function for output unit, default is linear
batchsize = 10)
## out-sample
## predict by dnn for test
predDN<- nn.predict(dnn, xTest)
# scale yScl back to original scale
predDN<-predDN*scale+min
plot(predDN,dataTest$Y)
fit2<-lm(Y ~ predDN,data=dataTest)
summary(fit2)
## insample
predDNIn<- nn.predict(dnn, xTrain)
predDNIn<-predDNIn*scale+min
plot(predDNIn,dataTrain$Y)
fit3<-lm(Y ~ predDNIn,data=dataTrain)
summary(fit3)
##### 4). h20 deep net
library(h2o)
localH2O <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
## Convert data into H2O
train_h2o <- as.h2o(localH2O, dataTrain)
test_h2o <- as.h2o(localH2O, dataTest)
#h2o deep net
model <- h2o.deeplearning(x=c(2:3), # x is 2dn snd 3rd columns
y=4, # y is 4th column, un-scaled
data=train_h2o,
classification = FALSE,
hidden=c(40,40),
epochs=100,
activation="Tanh",
adaptive_rate=TRUE,
input_dropout_ratio = c(0.5), # % of inputs dropout
hidden_dropout_ratios = c(0.5,0.5), # % for nodes dropout
train_samples_per_iteration = -1, # use all data, not sampled
)
## outsample
## Converting H2O format into data frame
df_yhat_test <- as.data.frame(h2o.predict(model, test_h2o) )
predH2O<- df_yhat_test$predict
plot(predH2O,dataTest$Y)
fit2<-lm(Y ~ predH2O,data=dataTest)
summary(fit2)
## in sample
df_yhat_train <- as.data.frame(h2o.predict(model, train_h2o) )
predH2OIn<- df_yhat_train$predict
plot(predH2OIn,dataTrain$Y)
fit2<-lm(Y ~ predH2OIn,data=dataTrain)
summary(fit2)
pdf("hist_h2o.pdf")
par(mfrow=c(3,1))
hist(predOLSIn, nclass=100)
hist(predDNIn,nclass=100)
hist(predH2OIn,nclass=100)
dev.off()