How to make reproducible result of deep learning model using the same input data

569 views
Skip to first unread message

yali....@gmail.com

unread,
Aug 31, 2015, 2:38:41 PM8/31/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hi All,

I am trying to get the exactly same result using deep learning model in H2O.
I first set seed and make reproducible to True in the model parameters. But I still get very different result each time when I run the model using same input data? Is there anyway to make it reproducible?

My H2O version is 3.0.0.26.

Could anyone help me with this? Thanks a lot.

Erin LeDell

unread,
Aug 31, 2015, 3:01:09 PM8/31/15
to yali....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi,
There is a logical argument called "reproducible" in h2o.deeplearning.
Set that to TRUE. It will be slow, but reproducible.

-Erin

Erin LeDell

unread,
Aug 31, 2015, 3:13:30 PM8/31/15
to yali....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Hi,
I failed to mention the key piece of info. You must be using a
single-threaded cluster in order for this to work. I will update the
docs to be more clear. Here is an R Demo:

library(h2o)
h2o.init(nthreads = 1)

# Import a sample binary outcome train/test set into R
train <-
read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv",
sep=",")
test <-
read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")


# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary
classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])

fit <- h2o.deeplearning(x = x, y = y, training_frame = training_frame,
reproducible = TRUE)
h2o.auc(fit)
#[1] 0.876274

fit2 <- h2o.deeplearning(x = x, y = y, training_frame = training_frame,
reproducible = TRUE)
h2o.auc(fit)
#[1] 0.876274

-Erin

Erin LeDell

unread,
Aug 31, 2015, 3:27:12 PM8/31/15
to yali....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Ok, third and final answer (my previous email had a mistake in the code)
-- I realized that you don't actually need to explicitly set
`h2o.init(nthreads = 1)`, it will force H2O to run single-threaded
automatically.

However, what you are probably missing is setting the seed directly in
the h2o.deeplearning function directly (rather than using `set.seed(1)`).

This code shows the correct way to enforce reproducibility:

library(h2o)
h2o.init(nthreads = -1)

# Import a sample binary outcome train/test set into R
train <-
read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv",
sep=",")
test <-
read.table("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv", sep=",")


# Convert R data.frames into H2O parsed data objects
training_frame <- as.h2o(train)
validation_frame <- as.h2o(test)
y <- "V1"
x <- setdiff(names(training_frame), y)
family <- "binomial"
training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary
classification
validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)])

fit <- h2o.deeplearning(x = x, y = y, training_frame = training_frame,
reproducible = TRUE, seed = 1)
h2o.auc(fit)
#[1] 0.873428

fit2 <- h2o.deeplearning(x = x, y = y, training_frame = training_frame,
reproducible = TRUE, seed = 1)
h2o.auc(fit2)
#[1] 0.873428

yali....@gmail.com

unread,
Aug 31, 2015, 4:47:00 PM8/31/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Erin,

Thank you very much for your reply. It helps.
I followed most of what you mentioned before except I set.seed() before calling the deep learning function rather than set it in the parameters.
Now the results are reproducible.

Best,
Yali

Reply all
Reply to author
Forward
0 new messages