Feb4 Meetup, sparkling water citibikes, ...

32 views

Skip to first unread message

Dan Bikle

unread,

Feb 6, 2015, 4:20:00 AM2/6/15

to h2os...@googlegroups.com

helloworld,

I'm working with the Feb4 sparkling-water citibike demo.

I started by getting the data from here:

http://www.civicdata.com/dataset/nyc-bike-share-trip-data

I got the 2013-09 csv and put it here:

/tmp/citi.csv

I inspected it with
head /tmp/citi.csv

I cloned the repo:

cd /tmp/
git clone g...@github.com:h2oai/sparkling-water.git
git log -1
commit 635547f2245845e03e573b7868f29886e8a49599
Author: mmalohlava <michal.m...@gmail.com>
Date:   Thu Feb 5 19:23:26 2015 -0800

I did this:

cd examples/src/main/scala/org/apache/spark/examples/h2o/

sed -i '/michal/s:/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/citibike-nyc/2013-09.csv:/tmp/citi.csv:' CitiBikeSharingDemo.scala

cd /tmp/sparkling-water

export SPARK_HOME="/home/dan/spark"
export MASTER="local-cluster[3,2,1024]"

./gradlew build -x test
./gradlew assemble

bin/run-example.sh CitiBikeSharingDemo

It seemed to do okay for awhile and then I saw many errors

example:

02-06 09:03:35.933 192.168.1.95:54329 9088 #UDP-Recv ERRR: UDP
Receiver error on port 54330java.lang.ArrayIndexOutOfBoundsException:
70

and this:

Exception in thread "main" java.lang.RuntimeException: Cloud size
under 3 at water.H2O.waitForCloudSize(H2O.java:874)

It continued spitting out exception messages and then eventually died.

I tried this:

pkill java
bin/run-example.sh CitiBikeSharingDemo

and it seemed to behave a little better and then issued this:

Exception in thread "main"
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Unresolved attributes: 'start_station_id,'start_station_id, tree:
'Aggregate [Days#15,'start_station_id],
[Days#15,'start_station_id,COUNT(1) AS bikes#16L]

Subquery brdd

SparkLogicalPlan (ExistingRdd
[tripduration#0,starttime#1,stoptime#2,start station id#3,start
station name#4,start station latitude#5,start station longitude#6,end
station id#7,end station name#8,end station latitude#9,end station
longitude#10,bikeid#11,usertype#12,birth year#13,gender#14,Days#15],
H2OSchemaRDD[6] at H2OSchemaRDD at H2OContext.scala:219)
at org.apache.spark.sql.catalyst.analysis.
Analyzer$CheckResolution$$anonfun$1.
applyOrElse(Analyzer.scala:80)

then it did not die.
It hung so I killed it with ctrl-c.

If you have any clues on how to debug and run this demo,
please send them.

I'm on ubuntu 14
Here is how spark 1.2.0 sees my setup:
Using Scala version 2.10.4
(Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60-ea)

Here is the commit I'm working with:

commit 635547f2245845e03e573b7868f29886e8a49599
Author: mmalohlava <michal.m...@gmail.com>
Date:   Thu Feb 5 19:23:26 2015 -0800

Thanks,
Dan

Michal Malohlava

unread,

Feb 6, 2015, 1:59:21 PM2/6/15

to h2os...@googlegroups.com

Hi Dan,

devel is still in progress...

Dne 2/6/15 v 1:20 AM Dan Bikle napsal(a):

It seemed to do okay for awhile and then I saw many errors

example:

02-06 09:03:35.933 192.168.1.95:54329 9088 #UDP-Recv ERRR: UDP
Receiver error on port 54330java.lang.ArrayIndexOutOfBoundsException:
70

You are running more H2Os on your machine

and this:

Exception in thread "main" java.lang.RuntimeException: Cloud size
under 3 at water.H2O.waitForCloudSize(H2O.java:874)

H2O does not cloud up on top of Spark. Possible reasons - spark node was restarted, or we were
not able to figure out number of Spark executors.
you can try to call:

new H2OContext(sc).start(<actual number of Spark executors>)

It continued spitting out exception messages and then eventually died.

I tried this:

pkill java
bin/run-example.sh CitiBikeSharingDemo

and it seemed to behave a little better and then issued this:

Exception in thread "main"
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Unresolved attributes: 'start_station_id,'start_station_id, tree:
'Aggregate [Days#15,'start_station_id],
[Days#15,'start_station_id,COUNT(1) AS bikes#16L]

Subquery brdd

SparkLogicalPlan (ExistingRdd
[tripduration#0,starttime#1,stoptime#2,start station id#3,start
station name#4,start station latitude#5,start station longitude#6,end
station id#7,end station name#8,end station latitude#9,end station
longitude#10,bikeid#11,usertype#12,birth year#13,gender#14,Days#15],
H2OSchemaRDD[6] at H2OSchemaRDD at H2OContext.scala:219)
at org.apache.spark.sql.catalyst.analysis.
Analyzer$CheckResolution$$anonfun$1.
applyOrElse(Analyzer.scala:80)

then it did not die.
It hung so I killed it with ctrl-c.

I did a trick for demo and renamed all columns in dataset: replaced ' ' by '_'
since Spark SQL cannot handle column names with spaces.
As I mentioned, we are heavily working on h2o as well as sparkling-water, so i expect
that these bugs will be solved soon.

michal

If you have any clues on how to debug and run this demo,
please send them.

I'm on ubuntu 14
Here is how spark 1.2.0 sees my setup:
Using Scala version 2.10.4
(Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60-ea)

Here is the commit I'm working with:

commit 635547f2245845e03e573b7868f29886e8a49599
Author: mmalohlava <michal.m...@gmail.com>
Date: Thu Feb 5 19:23:26 2015 -0800

Thanks,
Dan

--
You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages