Deep Learning for binary classification

644 views
Skip to first unread message

shaw38

unread,
May 19, 2016, 4:26:47 AM5/19/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hello H2O-users!I am running H2O cluster version 3.8.1.3 using R -version 3.2.2 on a 64-bit Windows Server 2008 R2 Standard machine. My dataset is in a comma-separated Excel sheet with 16384 attributes and 200 rows. 
 I am using H2O’s Deep Learning function to predict the mortality in a patient dataset ,where the response variable is binary (i.e values are either ‘0’ or ‘1’). My dataset has 16384 columns,with NO column-names. All the columns are numeric , there are no strings ,characters,etc. When I execute the following code,I get mortality predictions in terms of decimal values (as shown below),whereas I only need predictions that are either 0 or 1. How would I use Deep Learning to do this?   Kindly help.  Thanks a lot!

àPart of the Dataset:

0.131

0.297

0.633

0.492

0.704

0.747

0.491

0.698

0.738

0.481

0.771

0.532

0.311

0.496

0.001

0

0.638

0.009

0.991

0.44

0.414

0.009

0.021

0.999

0.773

0.01

0.032

0.01

0.006

0.042

0.988

0.993

1

0.549

0.577

0.99

0.719

0.534

0.028

0.008

0

0.569

0.983

0.985

0.025

0.022

0.6

0.374

 

àCode:

patient.train<- h2o.importFile("C:\\Users \\patient\\patient_trainingset.csv")

patient.test<-h2o.importFile("C:\\Users\\patient\\patient_testset.csv")

dim(patient.train)

#[1]    83 16384

dim(patient.test)

#[1]    81 16384

 

 y.dep<-16384

 x.indep<-1:16383

system.time(dlearning.model3<-h2o.deeplearning(y=y.dep,x=x.indep,training_frame=patient.train,
activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))

# user   system  elapsed

6.42        0.25        668.32

 
h2o.performance(dlearning.model3)
# ** Reported on training data. **

Description: Metrics reported on full training frame

 

MSE:  0.01964181

R2 :  0.851305

Mean Residual Deviance :  0.01964181

 
 
predict.dl2<-as.data.frame(h2o.predict(dlearning.model3,patient.test))

 

submi_dlearning3<-data.frame(Predicted_Mort=predict.dl2$predict)

 

write.csv(submi_dlearning3,file="submi_dlearning3.csv",row.names=F)

 

àPart of the current output:

Predicted_Mort

0.131749662

0.14698337

0.155728288

0.130509461

0.130420171

0.133914652

0.134027134

0.124258962

0.136126049

0.136019254

0.122301849

.

.

.

 

 

Darren Cook

unread,
May 19, 2016, 4:47:59 AM5/19/16
to h2os...@googlegroups.com
> I am using H2O’s Deep Learning function to predict the mortality in a patient
> dataset ,where the response variable is binary (i.e values are either ‘0’ or
> ‘1’). My dataset has 16384 columns,with NO column-names. ...

To do a binomial classification you need to set your response variable
to be "enum" type. You can do this on Flow, after loading the dataset.

Or from R:

data = h2o.importFile("yourdata.csv")
data[,16384] = as.factor(data[,16384])

With no column names, you need to know the index of the response
variable column; I guessed it was the last one.

Darren




--
Darren Cook, Software Researcher/Developer
My new book: Data Push Apps with HTML5 SSE
Published by O'Reilly: (ask me for a discount code!)
http://shop.oreilly.com/product/0636920030928.do
Also on Amazon and at all good booksellers!

Reena Shaw Muthalaly

unread,
May 19, 2016, 7:03:48 AM5/19/16
to dar...@dcook.org, H2O Open Source Scalable Machine Learning - h2ostream

Hello Darren,

Thank you for the suggestion. I tried it using 2 variations and these are the error messages that I received .(Variations highlighted in yellow):

-> Commands:
(1) patient.train<- h2o.importFile("C:\\Users \\patient\\patient_trainingset.csv")

patient.test<-h2o.importFile("C:\\Users\\patient\\patient_testset.csv")


 x.indep<-1:16383 
 patient.train[,16384]=as.factor(patient.train[,16384])

system.time(dlearning.model5<-h2o.deeplearning(y=patient.train[,16384],x=x.indep,training_frame=patient.train,activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))

Error in .verify_dataxy(training_frame, x, y, autoencoder) : 
  `y` must be a column name or index
Timing stopped at: 0.08 0 0.58 


(2) x.indep<-1:16383 
y.dep=as.factor(patient.train[,16384])
 
system.time(dlearning.model5<-h2o.deeplearning(y=y.dep,x=x.indep,training_frame=patient.train,activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))

Error in .verify_dataxy(training_frame, x, y, autoencoder) : 
  `y` must be a column name or index
Timing stopped at: 0.06 0 0.22 


-->N.B: Yes,the response variable is the 16384th column ,whose 1st few rows  in 'patient.train' looks like :

0
0
0
0
0
0
0
0
0
0
0
0


.
.
.



--
You received this message because you are subscribed to a topic in the Google Groups "H2O Open Source Scalable Machine Learning  - h2ostream" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h2ostream/zazcc2rtvwA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Darren Cook

unread,
May 19, 2016, 7:33:20 AM5/19/16
to Reena Shaw Muthalaly, H2O Open Source Scalable Machine Learning - h2ostream
> x.indep<-1:16383
> patient.train[,16384]=as.factor(patient.train[,16384])
>
> system.time(dlearning.model5<-h2o.deeplearning(y=patient.train[,16384],x=x.indep,training_frame=patient.train,activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))
>
> Error in .verify_dataxy(training_frame, x, y, autoencoder) :
> `y` must be a column name or index

y should be `16384`, not `patient.train[,16384]`. I.e.

system.time(dlearning.model5<-h2o.deeplearning(y=16384,x=x.indep,training_frame=patient.train,activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))

Darren

shaw38

unread,
May 24, 2016, 9:16:45 AM5/24/16
to H2O Open Source Scalable Machine Learning - h2ostream, rms...@gmail.com
Hello ,

-->I was able to convert column#16384 into factors, but then I get stuck with the 'h2o.predict()' function. I am unable to understand why the 'h2o.predict' command isn't working now, when it worked perfectly earlier (as posted in the earlier code snippets: predict.dl2<-as.data.frame(h2o.predict(dlearning.model3,patient.test)) )  . 

--> I'm pretty sure it's a very trivial piece of logic that I'm erring with. But, I can't gauge where I'm going wrong. Can you help me out? I'd appreciate it so much!! I will also attach a part of the test and training datasets in the next post, if that could be of help to you.

-->Here is the entire code.

library(h2o)

localH2O=h2o.init(nthreads=-1)

> s_train<-h2o.importFile("C:\\Users\\Desktop\\snp_trainingset_70_13.csv")


> s_test<-h2o.importFile("C:\\Users\\Desktop\\snp_testset_69_12.csv")

> dim(s_train)

[1]    83 16384

> dim(s_test)

[1]    81 16384

 

> s_train[,16384]=as.factor(s_train[,16384])

 

 

> x.indep<-1:16383

# now, y=16384

> system.time(dlearning.model6<-h2o.deeplearning(y=16384,x=x.indep,training_frame=s_train,activation="RectifierWithDropout",hidden=c(1200,50),epoch=100))

  |============================================| 100%

   user  system elapsed

   3.18    0.04   98.04

> h2o.performance(dlearning.model6)

H2OBinomialMetrics: deeplearning

** Reported on training data. **

Description: Metrics reported on full training frame

 

MSE:  6.163486e-16

R^2:  1

LogLoss:  3.458327e-09

AUC:  1

Gini:  1

 

Confusion Matrix for F1-optimal threshold:

        0  1    Error   Rate

0      70  0 0.000000  =0/70

1       0 13 0.000000  =0/13

Totals 70 13 0.000000  =0/83

 

Maximum Metrics: Maximum metrics at their respective thresholds

                      metric threshold    value idx

1                     max f1  1.000000 1.000000   1

2                     max f2  1.000000 1.000000   1

3               max f0point5  1.000000 1.000000   1

4               max accuracy  1.000000 1.000000   1

5              max precision  1.000000 1.000000   0

6                 max recall  1.000000 1.000000   1

7            max specificity  1.000000 1.000000   0

8           max absolute_MCC  1.000000 1.000000   1

9 max min_per_class_accuracy  1.000000 1.000000   1

 

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

 


--> And,here is where I get stuck: the 'h2o.predict' function:


> pred6<-as.data.frame(h2o.predict(dlearning.model6,s_test))

Error message:

 ERROR: Unexpected HTTP Status code: 404 Not Found (url = http://localhost:54321/4/Predictions/models/DeepLearning_model_R_1464053711040_1/frames/RTMP_sid_8367_4)


water.exceptions.H2OKeyNotFoundArgumentException

 [1] "water.api.ModelMetricsHandler.predict2(ModelMetricsHandler.java:236)"                 

 [2] "sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"                          

 [3] "sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)"        

 [4] "sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"

 [5] "java.lang.reflect.Method.invoke(Method.java:606)"                                     

 [6] "water.api.Handler.handle(Handler.java:62)"                                            

 [7] "water.api.RequestServer.handle(RequestServer.java:653)"                               

 [8] "water.api.RequestServer.serve(RequestServer.java:594)"                                

 [9] "water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:616)"                    

[10] "water.JettyHTTPD$H2oDefaultServlet.doPost(JettyHTTPD.java:564)"                       

[11] "javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"                         

[12] "javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"                         

[13] "org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"               


Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 

  


ERROR MESSAGE:


Object 'RTMP_sid_8367_4' not found in function: predict for argument: frame






shaw38

unread,
May 24, 2016, 9:38:08 AM5/24/16
to H2O Open Source Scalable Machine Learning - h2ostream, rms...@gmail.com
 
 
Here is a part of the training and test datasets.

      I really hope we can find a solution to this issue.

      Thank you so much.
5_train.xlsx
5_test.xlsx

Lauren DiPerna

unread,
May 24, 2016, 7:14:14 PM5/24/16
to shaw38, H2O Open Source Scalable Machine Learning - h2ostream
it looks like your response column in your training dataset is categorical, while your response column in your test set is continuous

A quick look at the top of test shows the following:

> head(s_test[,16384])
  C16384
1  0.471
2  0.466
3  0.403
4  0.981
5  0.038

Whereas train shows:

> head(s_train[,16384])
  C16384
1      0
2      1
3      0
4      0
5      1

Is this what you expected? In order for h2o.predict() to work you need to have the same type of label in both your train and test set.


do you get an error when you try (you'll need to convert the test response column to a factor as well, if your response column is discrete):

s_test[,16384]=as.factor(s_test[,16384])


--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.

Lauren DiPerna

unread,
May 25, 2016, 1:37:20 PM5/25/16
to Reena Shaw Muthalaly, H2O Open Source Scalable Machine Learning - h2ostream
just to double check it looks like your training set is categorical and your test set is continuous (for the files you attached) is it possible these got mixed up or perhaps I reversed them. If it were the reverse (train continuous, test enconded as 0/1) it should evaluate. 

If your test response column is numeric then you don't need to convert it with as.factor() (as.factor() only converts integers to enum type, for example if you had 0/1 in your response column you could use as.factor() to indicate that 0/1 are two categorical class types).


If it is the case that train response column is continuous, remove the step where you do `s_train[,16384]=as.factor(s_train[,16384])` since you would like this column to remain continuous. And this should fix your issue.

when you do the following do you get the same or the opposite output?



> head(s_test[,16384])
  C16384
1  0.471
2  0.466
3  0.403
4  0.981
5  0.038

Whereas train shows:

> head(s_train[,16384])
  C16384
1      0
2      1
3      0
4      0
5      1


On Wed, May 25, 2016 at 7:21 AM, Reena Shaw Muthalaly <rms...@gmail.com> wrote:
Hello Lauren,

Yes, this is what I expected: the 16384th column in the trainingset is continuous  while the response column in the testset should be categorical.

Also,I do get an error while executing the command s_test[,16384]=as.factor(s_test[,16384]).The error is as follows:
ERROR: Unexpected HTTP Status code: 400 Bad Request (url = http://localhost:54321/99/Rapids)

java.lang.IllegalArgumentException
 [1] "water.rapids.ASTColSlice.col_select(ASTColSlice.java:39)"                             
 [2] "water.rapids.ASTColSlice.apply(ASTColSlice.java:25)"                                  
 [3] "water.rapids.ASTExec.exec(ASTExec.java:46)"                                           
 [4] "water.rapids.ASTAsFactor.apply(ASTStrList.java:104)"                                  
 [5] "water.rapids.ASTExec.exec(ASTExec.java:46)"                                           
 [6] "water.rapids.ASTAppend.apply(ASTAssign.java:231)"                                     
 [7] "water.rapids.ASTAppend.apply(ASTAssign.java:225)"                                     
 [8] "water.rapids.ASTExec.exec(ASTExec.java:46)"                                           
 [9] "water.rapids.ASTTmpAssign.apply(ASTAssign.java:260)"                                  
[10] "water.rapids.ASTTmpAssign.apply(ASTAssign.java:253)"                                  
[11] "water.rapids.ASTExec.exec(ASTExec.java:46)"                                           
[12] "water.rapids.Session.exec(Session.java:56)"                                           
[13] "water.rapids.Exec.exec(Exec.java:63)"                                                 
[14] "water.api.RapidsHandler.exec(RapidsHandler.java:25)"                                  
[15] "sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"                          
[16] "sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)"        
[17] "sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"
[18] "java.lang.reflect.Method.invoke(Method.java:606)"                                     
[19] "water.api.Handler.handle(Handler.java:62)"                                            
[20] "water.api.RequestServer.handle(RequestServer.java:653)"                               
[21] "water.api.RequestServer.serve(RequestServer.java:594)"                                
[22] "water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:616)"                    
[23] "water.JettyHTTPD$H2oDefaultServlet.doPost(JettyHTTPD.java:564)"                       
[24] "javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"                         
[25] "javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"                         
[26] "org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"               

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  

ERROR MESSAGE:

Column must be an integer from 0 to 16382

--> How do we work around  this?

Thank you so much.

--
Regards,
Reena Shaw
Feel free  to connect with me via Linkedin and Twitter

Darren Cook

unread,
May 25, 2016, 2:11:56 PM5/25/16
to h2os...@googlegroups.com
> If it is the case that train response column is continuous, remove the step
> where you do `s_train[,16384]=as.factor(s_train[,16384])` ...

Though the thread subject says "binary classification". I wonder if it
is expected 0 to 0.5 are supposed to be 0, and 0.5 to 1.0 are supposed
to be 1?


If so (and apologies for thread-hijacking, if not) it seems there are
two approaches:

* Turn the continuous variable into 2 categories, then make a binary
classification model.

* Make a regression model, then take the predicted continuous value and
convert that to 0 or 1.

Is one approach clearly better than the other? (I have this exact
question in one of my projects, though with more than 2 categories, so
if the answer is "it depends", I'd be very interested in academic paper
references on this topic, etc.)

Thanks,
Darren

Lauren DiPerna

unread,
May 26, 2016, 6:38:58 PM5/26/16
to Reena Shaw Muthalaly, H2O Open Source Scalable Machine Learning - h2ostream
Hi Reena,

you can convert your test file to have a binary response column (as you have in your train file) by using the h2o.ifelse() function in R (which you can use to replace continuous response column values with 0 for any numerical values less than .5 and with 1 for numerical values greater or equal to .5).

so the lines of code you need are:
s_train[,16384] <- as.factor(s_train[,16384]) #you still want this to be categorical, and it should convert because it was originally an integer
s_test[,16384] <- h2o.ifelse(s_test[,16384] >= 0.5, 1, 0) # for all values greater or equal to .5 replace them with 1, if not replace with 0
s_test[,16384] <- as.factor(s_test[,16384]) # convert the integer test column to a categorical column

cheers,

Lauren

On Thu, May 26, 2016 at 6:01 AM, Reena Shaw Muthalaly <rms...@gmail.com> wrote:
Hi Lauren,
The last column in the training set has been made categorical (through the command that Darren suggested: s_train[,16384]=as.factor(s_train[,16384]).  The last column in the training_set only contains 0/1 as it denotes 'mortality',while the 1st 16383 columns contain continuous values.

The test_set has 16384 columns filled with continuous values. No 0/1 in it ,in any column (because,we need to predict it to be 0/1,i.e we need to predict the 'mortality' .)

My original intention while starting out the problem has been summed up well by Darren in the previous reply: " * Turn the continuous variable into 2 categories, then make a binary
classification model." 

Though now,since I'm not able to move forward,I think I'd be okay with it if I adopt his approach:"Though the thread subject says "binary classification". I wonder if it is expected 0 to 0.5 are supposed to be 0, and 0.5 to 1.0 are supposed to be 1?" and " * Make a regression model, then take the predicted continuous value and
convert that to 0 or 1." 

So,to tweak my question:
1) How can I correctly load the test and train files? (train_file:last column (16384th) is 0/1, test_file: all the 16384 columns have continous values; and I do not wish to delete any columns in either file)
2)How do I set a threshold for the prediction results from the regression model,so that 0 to 0.5 are supposed to be 0, and 0.5 to 1.0 are supposed to be 1?
Reply all
Reply to author
Forward
0 new messages