I am running a sparkling water cluster. And I am converting a spark dataframe to a h2o data frame But this takes very long. Does anyone know what is causing that. My data data frame is about 100 mb and has sparse vector of length over 30K. Do I have to prepare something to get a better performance ?
Environment:
-Runs on a data bricks spark cluster (Communication Edition)
-I tried also a VM with 2 nodes each 1 core and 2 gb ram
Thanks in advance.
--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
So Do you think it is more a configuration problem and not a programming problem from my side ?
Thanks in advance
I was about to post on this very topic when I saw this one.
I'm experiencing the same issue: long time to convert Spark frame to H2O frame. In fact each time I've tried I figured something must be wrong so I killed the job.
I'm running this on an EMR cluster so I have flexibility around RAM and number of nodes. Are there any rules of thumb in terms of minimizing the time the conversion takes? More nodes with less RAM?, fewer nodes with more RAM? The dataframe I'm converting is only about 1 million rows but it has about 4,000 columns and is sparse.
I'm in the beginning stages of experimentation but I was hoping somebody out there has some advice.
Thanks
Andy
09-09 21:57:02.918 172.31.20.129:54321 5390 #r thread INFO: ----- H2O started ----- 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Build git branch: rel-turchin 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Build git hash: 6f38021186c3619da42f6ced9d62974fffbea702 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Build git describe: jenkins-rel-turchin-6 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Build project version: 3.8.2.6 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Built by: 'jenkins' 09-09 21:57:02.954 172.31.20.129:54321 5390 #r thread INFO: Built on: '2016-05-24 10:55:46' 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Java availableProcessors: 4 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Java heap totalMemory: 9.97 GB 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Java heap maxMemory: 9.97 GB 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Java version: Java 1.7.0_111 (from Oracle Corporation) 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: JVM launch parameters: [-XX:OnOutOfMemoryError=kill %p, -Xms10240m, -Xmx10240m, -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=70, -XX:MaxHeapFreeRatio=70, -XX:+CMSClassUnloadingEnabled, -XX:OnOutOfMemoryError=kill -9 %p, -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1473457111953_0001/container_1473457111953_0001_01_000002/tmp, -Dspark.driver.port=43445, -Dspark.history.ui.port=18080, -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1473457111953_0001/container_1473457111953_0001_01_000002, -XX:MaxPermSize=256m] 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: OS version: Linux 4.4.11-23.53.amzn1.x86_64 (amd64) 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Machine physical memory: 14.69 GB 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: X-h2o-cluster-id: 1473458222225 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: User name: 'yarn' 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Opted out of sending usage metrics. 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Possible IP Address: eth0 (eth0), fe80:0:0:0:c22:a6ff:fe75:29ff%2 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Possible IP Address: eth0 (eth0), 172.31.20.129 09-09 21:57:02.955 172.31.20.129:54321 5390 #r thread INFO: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%1 09-09 21:57:02.956 172.31.20.129:54321 5390 #r thread INFO: Possible IP Address: lo (lo), 127.0.0.1 09-09 21:57:02.956 172.31.20.129:54321 5390 #r thread INFO: Internal communication uses port: 54322 09-09 21:57:02.956 172.31.20.129:54321 5390 #r thread INFO: Listening for HTTP and REST traffic on http://172.31.20.129:54321/ 09-09 21:57:03.027 172.31.20.129:54321 5390 #r thread INFO: H2O cloud name: 'sparkling-water-hadoop_-1299197974' on ip-172-31-20-129.ec2.internal/172.31.20.129:54321, discovery address /234.123.119.43:60027 09-09 21:57:03.027 172.31.20.129:54321 5390 #r thread INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555): 09-09 21:57:03.027 172.31.20.129:54321 5390 #r thread INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 ya...@172.31.20.129' 09-09 21:57:03.027 172.31.20.129:54321 5390 #r thread INFO: 2. Point your browser to http://localhost:55555 09-09 21:57:03.027 172.31.20.129:54321 5390 #r thread INFO: Log dir: '/var/log/hadoop-yarn/containers/application_1473457111953_0001/container_1473457111953_0001_01_000002' 09-09 21:57:03.028 172.31.20.129:54321 5390 #r thread INFO: Cur dir: '/mnt/yarn/usercache/hadoop/appcache/application_1473457111953_0001/container_1473457111953_0001_01_000002' 09-09 21:57:03.042 172.31.20.129:54321 5390 #r thread INFO: Using HDFS configuration from /etc/hadoop/conf 09-09 21:57:03.042 172.31.20.129:54321 5390 #r thread INFO: HDFS subsystem successfully initialized 09-09 21:57:03.043 172.31.20.129:54321 5390 #r thread INFO: S3 subsystem successfully initialized 09-09 21:57:03.043 172.31.20.129:54321 5390 #r thread INFO: Flow dir: '/var/lib/hadoop-yarn/h2oflows' 09-09 21:57:03.060 172.31.20.129:54321 5390 #r thread INFO: Cloud of size 1 formed [ip-172-31-20-129.ec2.internal/172.31.20.129:54321] 09-09 21:57:03.062 172.31.20.129:54321 5390 #r thread INFO: Registered 0 extensions in: 529mS 09-09 21:57:03.489 172.31.20.129:54321 5390 #r thread INFO: Registered: 124 REST APIs in: 427mS 09-09 21:57:04.034 172.31.20.129:54321 5390 #r thread INFO: Registered: 203 schemas in: 545mS 09-09 21:57:08.779 172.31.20.129:54321 5390 FJ-126-7 INFO: Cloud of size 10 formed [ip-172-31-20-126.ec2.internal/172.31.20.126:54321, ip-172-31-20-127.ec2.internal/172.31.20.127:54321, ip-172-31-20-128.ec2.internal/172.31.20.128:54321, ip-172-31-20-129.ec2.internal/172.31.20.129:54321, ip-172-31-20-130.ec2.internal/172.31.20.130:54321, ip-172-31-20-131.ec2.internal/172.31.20.131:54321, ip-172-31-20-132.ec2.internal/172.31.20.132:54321, ip-172-31-20-134.ec2.internal/172.31.20.134:54321, ip-172-31-20-135.ec2.internal/172.31.20.135:54321, ip-172-31-20-136.ec2.internal/172.31.20.136:54321] 09-09 21:59:36.038 172.31.20.129:54321 5390 #orker-31 INFO: Locking cloud to new members, because water.TaskGetKey