spark dataframe to h2o data frame (sparkling water) takes very long

900 views
Skip to first unread message

oben...@gmail.com

unread,
Aug 15, 2016, 11:27:41 AM8/15/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

I am running a sparkling water cluster. And I am converting a spark dataframe to a h2o data frame But this takes very long. Does anyone know what is causing that. My data data frame is about 100 mb and has sparse vector of length over 30K. Do I have to prepare something to get a better performance ?

Environment:
-Runs on a data bricks spark cluster (Communication Edition)
-I tried also a VM with 2 nodes each 1 core and 2 gb ram

Thanks in advance.

Avkash Chauhan

unread,
Aug 15, 2016, 2:36:04 PM8/15/16
to H2O Open Source Scalable Machine Learning - h2ostream, oben...@gmail.com
Would you please quantify how long does it take for such transformation? Also what if you have more memory to SW instance how does the value improve?

Avkash

Tom Kraljevic

unread,
Aug 15, 2016, 2:51:27 PM8/15/16
to Avkash Chauhan, H2O Open Source Scalable Machine Learning - h2ostream, oben...@gmail.com

fyi,


2 gb of ram is *really really small*.
i wouldn’t expect hardly anything to work with 2 gb of ram.

h2o’s rows are sparse (column compressed store) but the columns are dense today.
so if your sparsity is really high, the memory utilization could be higher for h2o than if the original representation is sparse in two dimensions.
[ note this is something we’re working on improving. ]


thanks,
tom


--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

oben...@gmail.com

unread,
Aug 15, 2016, 4:03:50 PM8/15/16
to H2O Open Source Scalable Machine Learning - h2ostream, oben...@gmail.com

it takes some 2000 seconds, means around 30 minutes for the MNIST data sets. For now I have no ability to get more memory because I am using the databricks community edition (that means one node with 6 gb ram.)

So Do you think it is more a configuration problem and not a programming problem from my side ?

Thanks in advance

oben...@gmail.com

unread,
Aug 15, 2016, 4:06:58 PM8/15/16
to H2O Open Source Scalable Machine Learning - h2ostream, avk...@gmail.com, oben...@gmail.com
Thanks tom for the quick response. So there is nothing to do except for the configuration of the ram ?

Andrew Agrimson

unread,
Sep 8, 2016, 6:21:50 PM9/8/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hello,

I was about to post on this very topic when I saw this one.

I'm experiencing the same issue: long time to convert Spark frame to H2O frame. In fact each time I've tried I figured something must be wrong so I killed the job.

I'm running this on an EMR cluster so I have flexibility around RAM and number of nodes. Are there any rules of thumb in terms of minimizing the time the conversion takes? More nodes with less RAM?, fewer nodes with more RAM? The dataframe I'm converting is only about 1 million rows but it has about 4,000 columns and is sparse.

I'm in the beginning stages of experimentation but I was hoping somebody out there has some advice.

Thanks
Andy

Tom Kraljevic

unread,
Sep 8, 2016, 10:18:50 PM9/8/16
to Andrew Agrimson, H2O Open Source Scalable Machine Learning - h2ostream

Hi Andy,


This is too vague to really give good advice for.
But if you can share a sample of the data we could take a look at it.

The shape of the cluster and shape of the data matters a lot.

You could also read in just a sample (say 10,000 rows) and get the H2O log files that show the breakdown of the data after H2O parsing (so after the asH2OFrame) and share those.


Thanks,
Tom

Andrew Agrimson

unread,
Sep 10, 2016, 9:56:17 AM9/10/16
to H2O Open Source Scalable Machine Learning - h2ostream, andrewa...@gmail.com
Hi Tom,

Thanks for the suggestion of sampling rows of the frame before converting to an H2O frame. I have some interesting findings which I will share below. I also experimented with decreasing the number of columns just to see how that impacted the time to convert. The h2o log is pasted at the bottom of the message for reference.

Here are the details of the EMR cluster I was using:

emr-4.7.1
12 nodes  
m3.xlarge EC2 instances (15 GB RAM)   

The full data set is nearly 1 MM rows and 4,000 columns. I tested scenarios sampling 1%, 10% and 100% of rows combined with 10%, 25% and 50% of columns. What I found was that it made almost no difference how many rows were included, the time to convert was approximately the same whether I used a 1% sample, 10% sample or all rows. What makes a dramatic difference though is the number of columns that are included. Doubling the number of columns more than doubles the time it takes to convert. In my specific scenario for example, ~250 columns:100s, ~ 500 columns:200s, ~ 1,000 columns: 500s, ~ 2,000 columns 2,000s. You can see that the time to convert is increasing at faster rate as I include more and more columns. If I had to guess converting the full data set with all 4,000 columns might be in the neighborhood of 8 10,000s, which is an awfully long time.        

Hopefully this is enough information to give you a better picture of what's going on here. Any advice would be very much appreciated.

Thanks
Andy


H2O log:

09-09 21:57:02.918 172.31.20.129:54321   5390   #r thread INFO: ----- H2O started  -----
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Build git branch: rel-turchin
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Build git hash: 6f38021186c3619da42f6ced9d62974fffbea702
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Build git describe: jenkins-rel-turchin-6
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Build project version: 3.8.2.6
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Built by: 'jenkins'
09-09 21:57:02.954 172.31.20.129:54321   5390   #r thread INFO: Built on: '2016-05-24 10:55:46'
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Java availableProcessors: 4
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Java heap totalMemory: 9.97 GB
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Java heap maxMemory: 9.97 GB
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Java version: Java 1.7.0_111 (from Oracle Corporation)
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: JVM launch parameters: [-XX:OnOutOfMemoryError=kill %p, -Xms10240m, -Xmx10240m, -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=70, -XX:MaxHeapFreeRatio=70, -XX:+CMSClassUnloadingEnabled, -XX:OnOutOfMemoryError=kill -9 %p, -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1473457111953_0001/container_1473457111953_0001_01_000002/tmp, -Dspark.driver.port=43445, -Dspark.history.ui.port=18080, -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1473457111953_0001/container_1473457111953_0001_01_000002, -XX:MaxPermSize=256m]
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: OS version: Linux 4.4.11-23.53.amzn1.x86_64 (amd64)
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Machine physical memory: 14.69 GB
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: X-h2o-cluster-id: 1473458222225
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: User name: 'yarn'
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Opted out of sending usage metrics.
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Possible IP Address: eth0 (eth0), fe80:0:0:0:c22:a6ff:fe75:29ff%2
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Possible IP Address: eth0 (eth0), 172.31.20.129
09-09 21:57:02.955 172.31.20.129:54321   5390   #r thread INFO: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%1
09-09 21:57:02.956 172.31.20.129:54321   5390   #r thread INFO: Possible IP Address: lo (lo), 127.0.0.1
09-09 21:57:02.956 172.31.20.129:54321   5390   #r thread INFO: Internal communication uses port: 54322
09-09 21:57:02.956 172.31.20.129:54321   5390   #r thread INFO: Listening for HTTP and REST traffic on http://172.31.20.129:54321/
09-09 21:57:03.027 172.31.20.129:54321   5390   #r thread INFO: H2O cloud name: 'sparkling-water-hadoop_-1299197974' on ip-172-31-20-129.ec2.internal/172.31.20.129:54321, discovery address /234.123.119.43:60027
09-09 21:57:03.027 172.31.20.129:54321   5390   #r thread INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
09-09 21:57:03.027 172.31.20.129:54321   5390   #r thread INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 ya...@172.31.20.129'
09-09 21:57:03.027 172.31.20.129:54321   5390   #r thread INFO:   2. Point your browser to http://localhost:55555
09-09 21:57:03.027 172.31.20.129:54321   5390   #r thread INFO: Log dir: '/var/log/hadoop-yarn/containers/application_1473457111953_0001/container_1473457111953_0001_01_000002'
09-09 21:57:03.028 172.31.20.129:54321   5390   #r thread INFO: Cur dir: '/mnt/yarn/usercache/hadoop/appcache/application_1473457111953_0001/container_1473457111953_0001_01_000002'
09-09 21:57:03.042 172.31.20.129:54321   5390   #r thread INFO: Using HDFS configuration from /etc/hadoop/conf
09-09 21:57:03.042 172.31.20.129:54321   5390   #r thread INFO: HDFS subsystem successfully initialized
09-09 21:57:03.043 172.31.20.129:54321   5390   #r thread INFO: S3 subsystem successfully initialized
09-09 21:57:03.043 172.31.20.129:54321   5390   #r thread INFO: Flow dir: '/var/lib/hadoop-yarn/h2oflows'
09-09 21:57:03.060 172.31.20.129:54321   5390   #r thread INFO: Cloud of size 1 formed [ip-172-31-20-129.ec2.internal/172.31.20.129:54321]
09-09 21:57:03.062 172.31.20.129:54321   5390   #r thread INFO: Registered 0 extensions in: 529mS
09-09 21:57:03.489 172.31.20.129:54321   5390   #r thread INFO: Registered: 124 REST APIs in: 427mS
09-09 21:57:04.034 172.31.20.129:54321   5390   #r thread INFO: Registered: 203 schemas in: 545mS
09-09 21:57:08.779 172.31.20.129:54321   5390   FJ-126-7  INFO: Cloud of size 10 formed [ip-172-31-20-126.ec2.internal/172.31.20.126:54321, ip-172-31-20-127.ec2.internal/172.31.20.127:54321, ip-172-31-20-128.ec2.internal/172.31.20.128:54321, ip-172-31-20-129.ec2.internal/172.31.20.129:54321, ip-172-31-20-130.ec2.internal/172.31.20.130:54321, ip-172-31-20-131.ec2.internal/172.31.20.131:54321, ip-172-31-20-132.ec2.internal/172.31.20.132:54321, ip-172-31-20-134.ec2.internal/172.31.20.134:54321, ip-172-31-20-135.ec2.internal/172.31.20.135:54321, ip-172-31-20-136.ec2.internal/172.31.20.136:54321]
09-09 21:59:36.038 172.31.20.129:54321   5390   #orker-31 INFO: Locking cloud to new members, because water.TaskGetKey

Andrew Agrimson

unread,
Sep 15, 2016, 12:29:53 PM9/15/16
to H2O Open Source Scalable Machine Learning - h2ostream, oben...@gmail.com
Hello,

I would like to follow on to this post with a little bit more information from my experimentation with different cluster shapes and sizes. 

The main issue I'm having is the large amount of time it takes to convert a very wide Spark data frame to an H2O frame. The full frame is about 4,000 columns wide. I'm running this on Sparking Water 1.6.5 on EMR 4.7.1 with Spark 1.6.1 installed. I've tried various cluster configurations, but I'm still experiencing the same issue. I've tried clusters with more nodes and less RAM (16 nodes, 15 GB each) to few nodes with a lot of RAM (2 nodes, 120 GB RAM) and everything in between and it doesn't seem to make a large difference in the time to convert. I should also note that I've haven't yet successfully load the full frame. When I refer to conversion times it's on data frames that have sampled columns.

I'm not sure which direction I should go at this point. Given that I don't understand the underlying process very well I feel like I'm fumbling around in the dark a bit.

Any advice on the matter would be greatly appreciated. 

Thanks
Andy

hzhang...@gmail.com

unread,
Oct 9, 2017, 2:43:16 PM10/9/17
to H2O Open Source Scalable Machine Learning - h2ostream
I guess my response is late. It is already 1 year after the original post. Anyway, I am posting my expererience of handling wide data with H2O sparkling water here.
I have a really wide csv file, with more than 50k columns. My experience is that reading the data directly into h2o dataframe, and working on this data frame is OK. However, everytime when I converted it to spark data frame, things were slowing down.

Here is something that might be useful to understand why Spark RDD has problem in handling wide data. https://issues.apache.org/jira/browse/SPARK-18016. Somehow, h2o smartly avoid this bottle neck.
Reply all
Reply to author
Forward
0 new messages