Re: as.h2o function in R h2o package

336 views
Skip to first unread message

Tom Kraljevic

unread,
Sep 28, 2015, 11:55:49 AM9/28/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

Technically yes, but it isn’t that easy to do.
There is a REST API called /3/PostFile.bin.
If you look at what h2o.uploadFile does, it uses this under the hood.

I recommend sticking to as.h2o() for small stuff and h2o.importFile() for big data.


Tom


On Sep 28, 2015, at 8:33 AM, Hideyoshi Maeda <hideyos...@gmail.com> wrote:

> Hi Tom,
>
> I am using the h2o package in R.
>
> And using the as.h2o() function. I noticed that this writes the data as csv to disk. Is there a way to transfer the data from in memory in R to in memory in h2o directly?
>
> Regards,
>
> Hiddi
>
>

Hideyoshi Maeda

unread,
Sep 28, 2015, 12:02:26 PM9/28/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Thank you for your advice regarding the h2o.importFile() however our files are large and take a long time to write to disk, and thus wanted the in memory version... Will investigate the REST API called /3/PostFile.bin.

Would Spark also be an option to move data from R -> Spark -> h2o? or perhaps R -> Python -> h2o? Is there a path you would recommend?

Regards,

Hiddi





Tom Kraljevic

unread,
Sep 28, 2015, 12:06:34 PM9/28/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

what is the actual data source?

Tom Kraljevic

unread,
Sep 28, 2015, 12:07:20 PM9/28/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

can you be more specific about “large” and “long”?
how big is your h2o cluster?

Hideyoshi Maeda

unread,
Sep 29, 2015, 3:09:13 AM9/29/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
I have 200 million rows and 300 columns.

It takes several hours to write to disk, much less time to upload to h2o.

My instance is 250 GB max memory and 128 cores.

Any suggestions about my other part of the question about process...

Would Spark also be an option to move data from R -> Spark -> h2o? or perhaps R -> Python -> h2o? Is there a path you would recommend?

Thanks,

Hiddi

Tom Kraljevic

unread,
Sep 29, 2015, 10:28:57 AM9/29/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

is this one big machine or many machines?

where does the data actually live?

Sent from my iPhone

Hideyoshi Maeda

unread,
Sep 29, 2015, 10:31:39 AM9/29/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
One big machine, and the data is created and manipulated in R

Tom Kraljevic

unread,
Sep 29, 2015, 10:44:33 AM9/29/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

i see.

then what you might try is creating a RAMdisk partition big enough to hold the resulting CSV file.

then it "looks like" you are writing to disk, and the toolchain is unchanged, but you are really writing to memory and it will be a lot faster (assuming that the disk is really the bottleneck and not R itself).

(if R itself turns out to be the bottleneck, then i'm not sure what you would do at that point...)

tom

Sent from my iPhone

Hideyoshi Maeda

unread,
Sep 29, 2015, 11:09:02 AM9/29/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Interesting solution,

Thanks for your suggestions.

I have not been able to find how to go about using the REST API for /3/PostFile.bin where could i find some documentation about this and perhaps examples of how it is used?

Regards,

Hiddi

Tom Kraljevic

unread,
Sep 29, 2015, 11:25:45 AM9/29/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

The best documentation would be the code for h2o.uploadFile.  Here it is:



And here is how to upload a raw file as bytes:
//
// Here is an example of how to upload a file from the command line.
//
// curl -v -F "file=@allyears2k_headers.zip" "http://localhost:54321/PostFile.bin?destination_frame=a.zip"
//
// JSON Payload returned is:
// { "destination_frame": "key_name", "total_bytes": nnn }
//

Note that after uploading a raw file, it still needs to be parsed.

You might write an R program using h2o.uploadFile() and capture the REST API transactions with h2o.startLogging() and look at them.


Tom

Hideyoshi Maeda

unread,
Sep 29, 2015, 12:35:19 PM9/29/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
I haven't quite tried all of this yet...but in your example of the curl request... in the Form argument (-F) that is it giving, it is a specific file...how do we use it for R objects in RAM, as this is not a file that can be referenced?

curl -v -F "file=@allyears2k_headers.zip" "http://localhost:54321/PostFile.bin?destination_frame=a.zip"

i.e in the above, I do not have an equivalent allyears2k_headers.zip

Furthermore the even if logging the h2o.uploadFile() function...it would still be a file that is uploaded and not the R object in memory.

Is my thinking correct?

Regards,

Hiddi

Tom Kraljevic

unread,
Sep 29, 2015, 12:58:23 PM9/29/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

On Sep 29, 2015, at 9:35 AM, Hideyoshi Maeda <hideyos...@gmail.com> wrote:

I haven't quite tried all of this yet...but in your example of the curl request... in the Form argument (-F) that is it giving, it is a specific file…

It’s an example of how to use the REST API to send raw data to h2o.
The raw data might be in a file or might be streamed somehow.
In this example, curl knows how to take data from a file and send it.
In your case, you would need to figure out how to do that.


how do we use it for R objects in RAM, as this is not a file that can be referenced?

Sorry, I don’t know.


curl -v -F "file=@allyears2k_headers.zip" "http://localhost:54321/PostFile.bin?destination_frame=a.zip"

i.e in the above, I do not have an equivalent allyears2k_headers.zip

Furthermore the even if logging the h2o.uploadFile() function...it would still be a file that is uploaded and not the R object in memory.

Is my thinking correct?

The RAMdisk approach would definitely be easier if it works.


Tom

Hideyoshi Maeda

unread,
Sep 30, 2015, 12:43:14 PM9/30/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Just wanted to check would a RAMdisk be created by doing the following...

sudo mount -t tmpfs -o size=2048M tmpfs /media/ramdisk

for a 2GB, so that 2GB of data can be written as csv to /media/ramdisk and then a h2o.fileUpload() can can the run referencing that file?

Thanks,

Hiddi

Tom Kraljevic

unread,
Oct 1, 2015, 12:02:48 PM10/1/15
to Hideyoshi Maeda, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

yes, thats the right idea.

Sent from my iPhone

Hideyoshi Maeda

unread,
Oct 5, 2015, 8:29:10 AM10/5/15
to Tom Kraljevic, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Hi Tom,

My early, tests suggests that the write speed to the RAMdisk is not hugely different, perhaps marginally quicker.

I have been doing a bit more thinking, and was wondering if h2o can take on stream of data, as an input, to prevent writing from disk. I.e if the data is generated in R as a data.frame and then it is 'streamed' into h2o as an input and then a use an existing model to predict.

Is this possible? and are there any suggestions on how to do this?

Regards,

Hiddi


Hideyoshi Maeda

unread,
Oct 5, 2015, 8:49:52 AM10/5/15
to Tom Kraljevic, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Are there any plans to include .rds files that reference a data.frame or data.table as acceptable data types for h2o.uploadFile()?


Hideyoshi Maeda

unread,
Oct 5, 2015, 9:46:23 AM10/5/15
to Tom Kraljevic, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Perhaps using the 'stream' package to implement to input stream into R? https://cran.r-project.org/web/packages/stream/index.html

Hideyoshi Maeda

unread,
Oct 5, 2015, 10:05:12 AM10/5/15
to Tom Kraljevic, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Given that i already have trained my model, and it now being mainly used for prediction, would getting the model POJO out be worthwhile? The aim would to to use the .java file as the model predictor, and then just convert the data from directly as an input for the model to get an output prediction, without writing to disk?

Tom Kraljevic

unread,
Oct 5, 2015, 3:31:02 PM10/5/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

Hi, sorry, I don’t have any helpful ideas for how to get data out of R faster.

From a computer science standpoint, the ramdisk would have done it if the system was the bottleneck.
So R is by far the most likely bottleneck.

POJO probably won’t save you here, the majority of the cost would still be getting the data out of R to the POJO.
But here is a self-contained POJO example for you to look at:


Tom

Hideyoshi Maeda

unread,
Oct 5, 2015, 4:00:01 PM10/5/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
I think making parallel curl requests might be easy enough, but wonder about how to potentially make the web-service parallel/multi-threaded/multi-cored? is  that possible?

hideyos...@gmail.com

unread,
Nov 3, 2015, 6:00:58 AM11/3/15
to H2O Open Source Scalable Machine Learning - h2ostream, to...@h2o.ai, hideyos...@gmail.com
Hi Tom,

I have had a bit more time looking at the app-consumer-loan example you provided here. https://github.com/h2oai/app-consumer-loan

It looks good and probably very close to what I might need.

A few hopefully quick things,

How do I adjust the gradle.build or any other relevant file in the github repo so that it only builds/compiles the http-server to accept curl requests, without having to build the front-end web-page that submits forms. And after that adjustment is made, I presume that the deployment of the war file would need to change so what would be the new function instead of ./gradlew jettyRunWar

Secondly, if i wanted to run several of the same models on different ports, how could I do that? This is so that I can send multiple curl requests (in parallel from R) to the different ports to get the predicted output for a specified input?

Any help would be much appreciated.

Regards,

Hiddi

Hideyoshi Maeda

unread,
Nov 3, 2015, 6:08:50 AM11/3/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Hi Tom,

I have had a bit more time looking at the app-consumer-loan example you provided here. https://github.com/h2oai/app-consumer-loan

It looks good and probably very close to what I might need.

A few hopefully quick things,

How do I adjust the gradle.build or any other relevant file in the github repo so that it only builds/compiles the http-server to accept curl requests, without having to build the front-end web-page that submits forms. And after that adjustment is made, I presume that the deployment of the war file would need to change so what would be the new function instead of ./gradlew jettyRunWar

Secondly, if i wanted to run several of the same models on different ports, how could I do that? This is so that I can send multiple curl requests (in parallel from R) to the different ports to get the predicted output for a specified input?

Any help would be much appreciated.

Regards,

Hiddi

Tom Kraljevic

unread,
Nov 3, 2015, 11:46:09 AM11/3/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream
On Nov 3, 2015, at 3:08 AM, Hideyoshi Maeda <hideyos...@gmail.com> wrote:

Hi Tom,

I have had a bit more time looking at the app-consumer-loan example you provided here. https://github.com/h2oai/app-consumer-loan

It looks good and probably very close to what I might need.

A few hopefully quick things,

How do I adjust the gradle.build or any other relevant file in the github repo so that it only builds/compiles the http-server to accept curl requests, without having to build the front-end web-page that submits forms. And after that adjustment is made, I presume that the deployment of the war file would need to change so what would be the new function instead of ./gradlew jettyRunWar

There is no front-end build, the files are all static.
You don’t need to change anything.


Secondly, if i wanted to run several of the same models on different ports, how could I do that? This is so that I can send multiple curl requests (in parallel from R) to the different ports to get the predicted output for a specified input?

Java web servlet containers are multi-threaded and re-entrant.
You wouldn’t expose multiple ports; instead you would just let the server have multiple threads in the thread pool.
(This is super-standard stuff going back to Java’s early days, nothing interesting related to H2O specifically.)


Tom

Hideyoshi Maeda

unread,
Nov 4, 2015, 4:47:25 AM11/4/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Thanks Tom, 

The adjustments you made to refactor code and showing some timings of the how it can handle many requests are at once is pretty cool thank you.

I notice that in script.R (https://github.com/h2oai/app-consumer-loan/blob/master/script.R)  you use h2o.gbm() and subsequently the use of:

import hex.genmodel.easy.prediction.BinomialModelPrediction;
import hex.genmodel.easy.prediction.RegressionModelPrediction;
import hex.genmodel.easy.*;

with
  static {
    BadLoanModel rawBadLoanModel = new BadLoanModel();
    badLoanModel = new EasyPredictModelWrapper(rawBadLoanModel);

    InterestRateModel rawInterestRateModel = new InterestRateModel();
    interestRateModel = new EasyPredictModelWrapper(rawInterestRateModel);
  }

and

  private BinomialModelPrediction predictBadLoan (RowData row) throws Exception {
    return badLoanModel.predictBinomial(row);
  }

  private RegressionModelPrediction predictInterestRate (RowData row) throws Exception {
    return interestRateModel.predictRegression(row);
  }




allow you to carry out a prediction.

In my own example, I instead of using h2o.gbm() i am using h2o.randomForest().

However after build and running a curl request results in the following response:

</head>
<body><h2>HTTP ERROR 406</h2>
<p>Problem accessing /predict. Reason:
<pre>    Prediction type unsupported by model of category Binomial</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/>                                                
<br/>                                                                                        
<br/>  
<br/>                                                                                        
<br/>                                                

</body>
</html>


So my question is how do you adjust PredictServlet.java to allow for prediction of random forest models.

Regards,

Hiddi

Tom Kraljevic

unread,
Nov 4, 2015, 7:56:17 AM11/4/15
to Hideyoshi Maeda, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream

it should not be different.

for the binomial case, make sure in your script you force the Y variable to be a factor with as.factor().

Tom

Sent from my iPhone

Hideyoshi Maeda

unread,
Nov 5, 2015, 4:59:18 AM11/5/15
to Tom Kraljevic, Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Have managed to get it working. Thank you for your help.

Just for my own understanding, in your opinion, should the process of using curl to carry out predictions, be faster or slower than the standard method of uploading data to h2o and predicting?


Tom Kraljevic

unread,
Nov 5, 2015, 6:58:35 AM11/5/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream

well, the method with the app-consumer-loan repo wraps each individual row prediction with a rest api call and shuffling around of the data row into many java objects.

it's meant to be a simple example of how to easily deploy a pojo model at a rest api endpoint.

the raw pojo itself is only a small piece of that end-to-end work, so it's not an apples-to-apples comparison with in-h2o bulk scoring, which has almost no overhead on a per-row basis if you have a large row count.

i would expect the app-consumer-loan way to have lower latency for predicting one row, and the in-h2o batch scoring way to have higher throughout for predicting many rows.

tom

Sent from my iPhone

Hideyoshi Maeda

unread,
Nov 5, 2015, 1:57:28 PM11/5/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Thanks, for that.

Are there any thoughts on perhaps doing thousands of batches of say 200 rows? on which method (curl or standard batch processing) might be better?

In addition, I have noticed for h2o.importFile() it accepts s3 links as suggested here (https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/H2O-DevS3Creds.md)

In terms of the underlying function? does this download the file to disk first and then upload the file to h2o?

Thanks,

Hiddi

Tom Kraljevic

unread,
Nov 5, 2015, 2:01:54 PM11/5/15
to Hideyoshi Maeda, H2O Open Source Scalable Machine Learning - h2ostream
Are there any thoughts on perhaps doing thousands of batches of say 200 rows? on which method (curl or standard batch processing) might be better?

If you don’t care about latency for individual predictions, i expect the fastest would be reading the data in as one big file and doing batch scoring in h2o.


In addition, I have noticed for h2o.importFile() it accepts s3 links as suggested here (https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/H2O-DevS3Creds.md)
In terms of the underlying function?  does this download the file to disk first and then upload the file to h2o?

No, it does not download the file to disk.


Hideyoshi Maeda

unread,
Nov 9, 2015, 2:40:20 PM11/9/15
to Tom Kraljevic, H2O Open Source Scalable Machine Learning - h2ostream
Thanks again, for your quick responses.

With regard to the h2o.importFile function when using s3, unfortunately due to admin issues, I do not have access to the AWS access key and secret key needed for the path argument for in h2oimportFile().

Would it be possible to use the ec2 instance profile (http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html) as the method of accessing s3? as I see that it might be supported from looking here (http://s3.amazonaws.com/h2o-release/h2o-classic/master/1761/docs-website/deployment/ec2_glossary.html)

If is is possible using instance profiles to not need the aws access key and secret key, then please can you let me know how this would be done?

Regards,

Hiddi
Reply all
Reply to author
Forward
0 new messages