Parallelization in R/python when H2O package is used

519 views
Skip to first unread message

Yerriswamy cherry

unread,
Jul 19, 2018, 7:12:30 AM7/19/18
to H2O Open Source Scalable Machine Learning - h2ostream
Hi all,

I am presently working on H2O in Python and  R for building gradient boosting models(GBM). 

I have used functions like doParallel,foreach(in R ) and multiprocessing (in python) for parallelization of GBM. But my runtimes are longer than expected. I am not able measure/observe how H2O is utilizing my system cores and memory.

Below is the help i need:

  1. How does H2O perform parallelization?
  2. what does "H2O is parallel at algorithm level and not at model level" mean?
  3. GBM is a sequential model. Can I parallelize a sequential process? If yes, how can we do it in R/python and how can we be sure/observe that parallel processing is happening as intended.
  4. If i am building decision trees will "H2O make sure that all tress all build parallely in different processors"? or do i need to specify that using in built functions?
Please give me an detailed explanation, i am unable to understand these concepts.

Regards,

Tom Kraljevic

unread,
Jul 19, 2018, 9:55:23 AM7/19/18
to Yerriswamy cherry, H2O Open Source Scalable Machine Learning - h2ostream
]Hi all,

I am presently working on H2O in Python and  R for building gradient boosting models(GBM). 

I have used functions like doParallel,foreach(in R ) and multiprocessing (in python) for parallelization of GBM. But my runtimes are longer than expected. I am not able measure/observe how H2O is utilizing my system cores and memory.

Below is the help i need:

  1. How does H2O perform parallelization?

The h2o R client sends a socket message to the h2o server (written in Java).
The server is multithreaded and knows how to parallelize stuff.
You don't need to do anything special.

  1. what does "H2O is parallel at algorithm level and not at model level" mean?

I'm not totally sure.  H2O algorithms are written from scratch in Java to be parallel and distributed.
When you run an algorithm like GBM it builds a single model in parallel one model at a time.

  1. GBM is a sequential model. Can I parallelize a sequential process? If yes, how can we do it in R/python and how can we be sure/observe that parallel processing is happening as intended.

H2O just does it.  You don't need to do anything except call h2o.gbm().
In particular, do not use doParallel or any R parallelism construct.  R doesn't do any of the work, R just tells the Java server what to do.


  1. If i am building decision trees will "H2O make sure that all tress all build parallely in different processors"? or do i need to specify that using in built functions?
Please give me an detailed explanation, i am unable to understand these concepts.

You don't have to do anything.  H2O handles it.


Tom


Erin LeDell

unread,
Jul 19, 2018, 4:36:36 PM7/19/18
to Tom Kraljevic, Yerriswamy cherry, H2O Open Source Scalable Machine Learning - h2ostream

There is additional information about this topic in this SO question: https://stackoverflow.com/questions/43444333/parallel-processing-in-r-with-h2o

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell, Ph.D.
Chief Machine Learning Scientist | H2O.ai
Reply all
Reply to author
Forward
Message has been deleted
0 new messages