Parallelize AutoML?

tszum...@gmail.com

unread,

Jul 29, 2018, 9:05:49 PM7/29/18

to H2O Open Source Scalable Machine Learning - h2ostream

Is it possible to run candidate models in parallel across several workers with h2o AutoML? Wondering if I can leverage a cluster with something like joblib or Dask to run more models.

Tom Kraljevic

unread,

Jul 30, 2018, 1:40:52 AM7/30/18

to tszum...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

although one model is built at a time, the creation of that model is a parallel operation.
perhaps a good future feature to add sometime would be a model concurrency factor.

if you use multiple api clients pointing to one cluster, the jobs from the different clients are actually processed in parallel. you just need to be careful because nothing prevents the different jobs from for example causing each other to run out of memory.

thanks
tom

> On Jul 29, 2018, at 6:05 PM, tszum...@gmail.com wrote:
>
> Is it possible to run candidate models in parallel across several workers with h2o AutoML? Wondering if I can leverage a cluster with something like joblib or Dask to run more models.
>

> --
> You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

tszum...@gmail.com

unread,

Jul 30, 2018, 8:30:12 AM7/30/18

to H2O Open Source Scalable Machine Learning - h2ostream

I'm sorry I am not sure I followed. As I understand it, AutoML assesses multiple models and multiple hyperparameters per model, correct?

So when I kick off an AutoML call, it parallelize across each model currently, e.g. assess models in parallel, but hyperparameters in parallel?

And if I run multiple calls, they will all contribute to the same leaderboard, but run different models/params?

Tom Kraljevic

unread,

Jul 30, 2018, 9:30:20 AM7/30/18

to tszum...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

> On Jul 30, 2018, at 5:30 AM, tszum...@gmail.com wrote:
>
> I'm sorry I am not sure I followed. As I understand it, AutoML assesses multiple models and multiple hyperparameters per model, correct?

yes

> So when I kick off an AutoML call, it parallelize across each model currently, e.g. assess models in parallel, but hyperparameters in parallel?

no.

as it moves through the <modeltype,hyperparams> search space it builds one model (leaderboard entry) at a time.

> And if I run multiple calls, they will all contribute to the same leaderboard, but run different models/params?

different leaderboards.

(i’m not sure what would happen if you tried piling them into the same leaderboard...)

tszum...@gmail.com

unread,

Jul 30, 2018, 9:56:31 AM7/30/18

to H2O Open Source Scalable Machine Learning - h2ostream

Thank you. So if I wanted to speed things up I can break the search space across different calls, but need to be careful about resource utilization.

Erin LeDell

unread,

Jul 30, 2018, 2:18:56 PM7/30/18

to tszum...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Please please please do not double post questions on Stack Overflow and Google Groups. This is very clearly written in the guidelines on our google groups description.

When you do this, it means that two separate people (who are very busy and have limited resources to spend answering questions) will spend their time answering the question. I just spent 30 mins researching (and talking to the H2O team), then writing up a response on Stack Overflow only to see that you have already been having another conversation here.

https://stackoverflow.com/questions/51583633/parallel-execution-for-h2o-automl

--

Erin LeDell, Ph.D.
Chief Machine Learning Scientist | H2O.ai

On Jul 30, 2018, at 6:56 AM, tszum...@gmail.com wrote:

Thank you. So if I wanted to speed things up I can break the search space across different calls, but need to be careful about resource utilization.

tszum...@gmail.com

unread,

Jul 30, 2018, 2:48:38 PM7/30/18

to H2O Open Source Scalable Machine Learning - h2ostream

Erin,

My apologies. I posted to Stack Overflow but after reading a comment to my question over there, I discovered this group and realized it may be more appropriate to post here. I meant to close up the StackOverflow thread but it appears I didn't do it in time. I'll button that up now.

Some things to note as to why I (and perhaps other new users in the past) may have had some confusion:

(1) If you go to the google group from a mobile device, you do not see the clear, bold front-page announcement. It just lists the threads.

(2) Over at stackoverflow, I wasn't able to find a link to the google group. It came up through web searches. So coming from the stackoverflow entrance point, one may not know the rules laid out here.

(3) For h2o, there are are currently 1045 active questions tagged with "h2o", but 373 of those tagged questions marked as "unanswered" using the top nav filters. This gave me the impression that h2o questions belonged elsewhere.

I'll definitely stick to the guidelines moving forward!

-Tom

Erin LeDell

unread,

Jul 30, 2018, 7:02:44 PM7/30/18

to tszum...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Hi Tom,

Thanks for the explanation and feedback. Regarding your points below,

(1) I looked into why the “Welcome message” does not appear in google groups on mobile, but I don’t see any way to change that unfortunately.

(2) I have updated the Wiki section for the h2o tag on Stack Overflow to include a bunch more information about “best practices” for asking questions: https://stackoverflow.com/tags/h2o/info including a link to the google group. (My edit needs to be peer reviewed, so its not visible yet). This should help new users in the future.

(3) We try to answer all the good questions on Stack Overflow. Many are not well-written, not code-related, or do not contain reproducible examples, which is why 35% are unanswered. Stack Overflow is definitely the preferred method because it’s much nicer formatting and discoverability compared to Google Groups. So unless a question is not specifically code-related and appropriate for SO, we prefer that people use SO.

Thanks for understanding!

Best,

Erin

P.S. Lastly, if you do write a script that uses joblib or dask to parallelize the training of a bunch of models in parallel, we’d be interested in posting that somewhere so that other people can use it as well. Thanks!