How do I read in partioned parquet files using R or Flow?

379 views
Skip to first unread message

J. de Jager

unread,
May 25, 2017, 10:49:01 AM5/25/17
to H2O Open Source Scalable Machine Learning - h2ostream
I used sparkR to do my heavy lifting and wrote out partitioned parquet files to Hadoop. I'm now trying to read them back in using R/Flow but I'm not getting much success.

Below is the error I'm getting when using Flow:

DistributedException from ... , caused by java.lang.IllegalStateException: We only accept parser readers backed by a Vec (no streaming support!).

h2o version = 3.10.3.3
R version = 3.3.2
Running on : CentOS Linux 7 (Core)

Erin LeDell

unread,
May 25, 2017, 2:58:48 PM5/25/17
to J. de Jager, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

Can you post the code that you are using to try to read them in?

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
Erin LeDell Ph.D.
Statistician & Machine Learning Scientist | H2O.ai

Erin LeDell

unread,
May 25, 2017, 2:59:17 PM5/25/17
to J. de Jager, H2O Open Source Scalable Machine Learning - h2ostream

I realize you are using Flow, but since it's easier to debug with code, let's make sure it works in R (or Python) first.

J. de Jager

unread,
May 26, 2017, 2:33:51 AM5/26/17
to H2O Open Source Scalable Machine Learning - h2ostream, jdejage...@gmail.com
df = h2o.importFile(/path/to/folder/with/parquetfiles)

The error message in R is :

DistributedException from ... , caused by java.lang. IllegalStateException: We only accept parser readers backed by a Vec (no streaming support!).

I'm trying to import 201 partitioned files. My guess is that H2o thinks I'm trying to stream in data, and has a limit on how many small files you can import at once. Is there perhaps a work-around? What is the max number of files you can import at once?

Thanks for your help!

Erin LeDell

unread,
May 26, 2017, 3:06:38 PM5/26/17
to J. de Jager, H2O Open Source Scalable Machine Learning - h2ostream, Michal Kurka

Thanks for the extra information.  I am cc-ing Michal Kurka who implemented parquet support and might be able to provide more help.

Best,
Erin

Erin LeDell

unread,
May 26, 2017, 3:09:24 PM5/26/17
to J. de Jager, H2O Open Source Scalable Machine Learning - h2ostream, Michal Kurka

Another thought --

If you are trying to go from SparkR -> parquet in Hadoop -> H2O, then there might be a more efficient way to achieve your goal.  Have you looked into the rsparkling package (a connector package on top of the sparklyr package?)  https://github.com/h2oai/rsparkling  This will allow you to go from Spark -> H2O directly (using R).

-Erin

Michal Kurka

unread,
May 26, 2017, 3:11:15 PM5/26/17
to Erin LeDell, J. de Jager, H2O Open Source Scalable Machine Learning - h2ostream, Michal Kurka
Hi Erin, J,

version 3.10.3.3 is quite old, it did have a bug that caused this behavior when parsing more than 128 parquet files.

The latest version should not suffer from this issue. My advice is to upgrade. Workaround would be to parse the data in two batches, both smaller than 128 files and row bind the partial frames together.

Michal Kurka

To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

J. de Jager

unread,
May 30, 2017, 9:58:44 AM5/30/17
to H2O Open Source Scalable Machine Learning - h2ostream, jdejage...@gmail.com, mic...@h2o.ai
I prefer SparkR for data processing due to sparklyr's caching policy..

I have used rsparkling quite a bit. It's the only way I know how to setup H2o to run on my company's cluster ( 4 Hadoop Nodes with 128 cores each, and 512G each). I would actually prefer setting everything up without having to rely on spark but I'm not entirely sure how to do that from R.

J. de Jager

unread,
May 30, 2017, 10:00:08 AM5/30/17
to H2O Open Source Scalable Machine Learning - h2ostream, er...@h2o.ai, jdejage...@gmail.com, mic...@h2o.ai, mic...@0xdata.com
Thanks a lot. I'm super excited to upgrade to 3.12. I tested autoML yesterday and it's great!

Thanks a lot for open sourcing it, it's amazing! 
Reply all
Reply to author
Forward
0 new messages