spark sql DataFrame to H2OFrame

509 views
Skip to first unread message

sunil v

unread,
Nov 6, 2015, 12:41:33 PM11/6/15
to H2O Open Source Scalable Machine Learning - h2ostream

Hi,


I am trying to convert a spark DataFrame to a H2OFrame,


I am getting the following exception. Shouldn't I be able to do an implicit conversion? How can I convert it to a H2OFrame?


Thanks.


scala> val bigDataFrame: H2OFrame = df



<console>:24: error: type mismatch;

 found  
: org.apache.spark.sql.DataFrame

 required
: org.apache.spark.h2o.H2OFrame

   
(which expands to)  water.fvec.H2OFrame

       val bigDataFrame
: H2OFrame = df


                                    ^


Michal Malohlava

unread,
Nov 6, 2015, 12:43:40 PM11/6/15
to h2os...@googlegroups.com
Hi there,


On 11/6/15 9:41 AM, sunil v wrote:

Hi,


I am trying to convert a spark DataFrame to a H2OFrame,


I am getting the following exception. Shouldn't I be able to do an implicit conversion? How can I convert it to a H2OFrame?

 if you would like to use the expression: `val bigDataFrame: H2OFrame = df`
you have to do: `import h2oContext._`

or
you can simply: `val bigDataFrame: H2OFrame = h2oContext.asH2OFrame(df)`

Thank you!
michal

Thanks.


scala> val bigDataFrame: H2OFrame = df



<console>:24: error: type mismatch;

 found  
: org.apache.spark.sql.DataFrame

 required
: org.apache.spark.h2o.H2OFrame

   
(which expands to)  water.fvec.H2OFrame

       val bigDataFrame
: H2OFrame = df


                                    ^


--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sunil v

unread,
Nov 6, 2015, 1:01:29 PM11/6/15
to H2O Open Source Scalable Machine Learning - h2ostream, mic...@h2oai.com
Thanks Michal, I am importing h2oContext._ now.

Now I am getting this exception, (I am trying it out with spark 1.5.1)

java.lang.IllegalArgumentException: Unsupported type DecimalType(26,2)

at org.apache.spark.h2o.H2OSchemaUtils$.dataTypeToVecType(H2OSchemaUtils.scala:119)

at org.apache.spark.h2o.H2OContext$$anonfun$5.apply(H2OContext.scala:297)

at org.apache.spark.h2o.H2OContext$$anonfun$5.apply(H2OContext.scala:295)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.Range.foreach(Range.scala:141)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at org.apache.spark.h2o.H2OContext$.toH2OFrame(H2OContext.scala:295)

at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:51)

sunil v

unread,
Nov 6, 2015, 4:59:49 PM11/6/15
to H2O Open Source Scalable Machine Learning - h2ostream, mic...@h2oai.com
Looking at the code on github, there is no case to handle DecimalType. I am not sure if this DatatType is new to spark 1.5.1

def dataTypeToVecType(dt : DataType) : Byte = dt match {
case BinaryType => Vec.T_NUM
case ByteType => Vec.T_NUM
case ShortType => Vec.T_NUM
case IntegerType => Vec.T_NUM
case LongType => Vec.T_NUM
case FloatType => Vec.T_NUM
case DoubleType => Vec.T_NUM
case BooleanType => Vec.T_NUM
case TimestampType => Vec.T_TIME
case StringType => Vec.T_STR
//case StructType => dt.
case _ => throw new IllegalArgumentException(s"Unsupported type $dt")
}

Michal Malohlava

unread,
Nov 7, 2015, 7:05:54 PM11/7/15
to sunil v, H2O Open Source Scalable Machine Learning - h2ostream
Sunil, thanks for that!

Great catch!

I will fix that in the next release of Sparkling Water.

Thank you!
Michal

sunil v

unread,
Nov 12, 2015, 12:43:02 AM11/12/15
to H2O Open Source Scalable Machine Learning - h2ostream, suni...@gmail.com, mic...@h2oai.com
Thanks Michal. I am casting all DecimalTypes to double for now and this is working okay.

My code is working fine in sparkling shell. However, when I use spark-submit to run my app, I am getting the following exception (in H2OContext.asH2OFrame). H2OContext was initialized okay
-----------------------------------
INFO H2OContext: Sparkling Water started, status of context:
Sparkling Water Context:
 * number of executors: 10
------------------------------------
org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:703)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:702)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:702)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1514)
    at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1438)
    at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1724)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1723)
    at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:587)
    at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
    at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
    at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912)
    at org.apache.spark.h2o.H2OContext$.toH2OFrame(H2OContext.scala:288)
    at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:51)

sunil v

unread,
Nov 12, 2015, 1:17:03 PM11/12/15
to H2O Open Source Scalable Machine Learning - h2ostream, suni...@gmail.com, mic...@h2oai.com
Found the issue in h2ologs and so wanted to update this post. There was not enough heap memory. Increased the driver memory and its fine now.

Hansu Gu

unread,
Jan 13, 2017, 6:34:16 PM1/13/17
to H2O Open Source Scalable Machine Learning - h2ostream, suni...@gmail.com, mic...@h2oai.com
I would like to follow up on this issue although it's been a year old. I've seen the exact same issue on sparkling water 2.0.3. The DecimalType is unavailable in SparkDataFrameConverter.scala. Would it possible to add some support for this?

Thanks,
Hansu

Avkash Chauhan

unread,
Jan 19, 2017, 1:51:20 AM1/19/17
to H2O Open Source Scalable Machine Learning - h2ostream, suni...@gmail.com, mic...@h2oai.com
We have recently added DecimalType and DateType to SW. We expect that newer release will have these changes very soon.

Thanks,
Avkash
Reply all
Reply to author
Forward
0 new messages