Run coxph model for large data set with 300 columns( 6 GB ) in H2o sparkling water

117 views
Skip to first unread message

divya....@gmail.com

unread,
Dec 2, 2019, 9:15:03 AM12/2/19
to H2O Open Source Scalable Machine Learning - h2ostream
We are trying to run coxph model using h2o,Rsparkling for large data set with 6 GB with 300 columns, whatever the configuration we take for spark, we are getting memory issues.

As per h2o, we should only have 4 times data size bigger cluster, but we took even 128GB 4 worker nodes with a 128 master node. But still its raising issues.

Please help us to choose the spark configuration needed to run h2o with our current data set. We are able to run the same code for 50,000 records.

we have 300 columns for X and 2 pairs of interaction terms. offset column and weights as well.

You can find the sample code here but it doesnt have 300 column. I don't know how I can give the perfect input file and full code to replicate the issue. Please let me know if you prefer to see the actual code with 300 columns.


`# Load the libraries used to analyze the data
 library(survival)
 library(MASS)
 library(h2o)


 # Create H2O-based model
 predictors <- c("HasPartner", "HasSingleLine", "HasMultipleLines",
            "HasPaperlessBilling", "HasAutomaticBilling", 
 "MonthlyCharges",
            "HasOnlineSecurity", "HasOnlineBackup", "HasDeviceProtection",
            "HasTechSupport", "HasStreamingTV", "HasStreamingMovies")

 h2o_model <- h2o.coxph(x = predictors,
                   event_column = "HasChurned",
                   stop_column = "tenure",
                   stratify_by = "Contract",
                   training_frame = churn_hex)

  print(summary(h2o_model))'


We tried mutiple conf's, some of them are below conf$spark.executor.memory <- "192g"
conf$spark.executor.cores <-5
conf$spark.executor.instances <- 9
conf$'sparklyr.shell.executor-memory' <- "32g"
conf$'sparklyr.shell.driver-memory' <- "32g"
conf$spark.yarn.am.memory <- "32g"
conf$spark.dynamicAllocation.enabled <- "false"
conf$spark.driver.memory="57.6g"
sc <- spark_connect(master = "yarn-client", version = "2.4.3",config = conf)
We have also tried with this Sys.setenv(SPARK_HOME="/usr/lib/spark")
conf <- spark_config()
conf$spark.executor.memory <- "44g"
conf$spark.executor.cores <-8
conf$spark.executor.instances <- 5
conf$spark.dynamicAllocation.enabled <- "false"
sc <- spark_connect(master = "yarn-client", version = "2.4.3",config = conf)

Tom Kraljevic

unread,
Dec 2, 2019, 10:19:32 AM12/2/19
to divya....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Thank you for your earlier question on StackOverflow, which has already gotten some attention:

https://stackoverflow.com/questions/59077793/run-coxph-model-for-large-data-set-with-300-columns-6-gb-in-h2o-sparkling-wat

Redirecting back to the question above to avoid duplicating work for the community...

Thanks
Tom


On Dec 2, 2019, at 6:15 AM, divya....@gmail.com wrote:


--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h2ostream/a49fa53d-a533-4284-b18e-757404a940ac%40googlegroups.com.

divya....@gmail.com

unread,
Dec 9, 2019, 9:53:16 AM12/9/19
to H2O Open Source Scalable Machine Learning - h2ostream
Hi We are trying stack overflow comments, but didnt get a great results.

This exact code runs successfully in base R for the same data set on a 16 GB laptop.  Yet, with 365 GB 5-10 worker nodes are freezing up and crashing. I have tried every possible configuration including garbage collection and partitioning. Nothing works, JVMs with massive memory and processors should not be freezing up and crashing on large data sets. We are assuming there is a problem with the H2O code for the Cox Proportional Hazard model to run big data files.  We also got issues with predict function for coxph model with offsets/ weights are not working etc.  H2O says that if I post on Google Groups they will direct my question to an H2O engineer or data scientist to fix the problem.  Can you confirm is h2o coxph model have any data size limits?  Or at least confirm whether we need really a bigger cluster than 5-10 worker nodes of 365GB. We have been spending lot of time on this issue. This model is being run for a medical application.  We have some urgency with this model. It will be very helpful for us , if some one could confirm its not a bug and guide further. 

Tom Kraljevic

unread,
Dec 9, 2019, 2:55:25 PM12/9/19
to divya....@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

hi,


as per the replies on stackoverflow:

i recommend you prep the data so you are only doing the data ingest and model training in a single big H2O java process without spark, and turn on GC logging.  (if you start h2o-3 on hadoop, this logging will happen automatically.)

-XX:PrintGCDetails
-XX:PrintGCTimeStamps

and then post the entire log output somewhere visible.  it's not possible to debug these kinds of things without the log file.

(also please be aware, while the h2o engineering team does quite a bit work driven by feedback from the community, for those organizations that do want closer attention and priority, an enterprise support option is available.)


thanks!
tom



-- 
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages