parse and combine multiple files (also Scala)

503 views
Skip to first unread message

Jeremy Davis

unread,
Jan 26, 2014, 12:59:53 PM1/26/14
to h2os...@googlegroups.com
Hello!
Currently I have hadoop writing out several part files (all with the same header), which I import in to a bucket in S3.
Is it possible to parse all the files in a bucket to a single hex file? (or use a wildcard, etc) 

Also, super excited about the Scala interface. I notice that a lot of the web links are broken however.

-JD

Sri

unread,
Jan 26, 2014, 2:12:23 PM1/26/14
to Jeremy Davis, h2os...@googlegroups.com
Jeremy,
Sure - Multiple files in a single folder/s3 bucket (regex controlled in parse page) can be mapped to a single key in H2O.

Thanks for the Scala call out - Do you have time for a connect on usecase? Are you doing something along the lines of Scalding?

Thanks,
Sri
--
You received this message because you are subscribed to the Google Groups "H2O Users - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tom Kraljevic

unread,
Jan 26, 2014, 2:35:25 PM1/26/14
to Jeremy Davis, h2os...@googlegroups.com

Hi Jeremy,


Yes, it is possible.


What you'll need to do is configure your H2O so it can read from S3 (over HDFS, so S3N).
If you use our ec2 scripts to start instances, then things will just work for you.

After that, use the Import HDFS Web UI menu item (with s3n:// instead of hdfs://).

Or you can start from R snippet below, which I modified from one of our HDFS unit tests.

Let me know if you have more questions.


(Note:  I tested this on latest top of tree master.  Depending on what release you have we
may need to tweak the R code if you're running from R.)


Thanks,
Tom




S3N Setup:

The following is adoped from these files, which we use to set up AWS instances.
    h2o/ec2/h2o-cluster-distribute-aws-credentials.sh
    h2o/ec2/ami/start-h2o-bg.sh

Command line options:
    java -Xmx1g -jar h2o.jar -hdfs_config core-site.xml




core-site.xml config file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!--
<property>
  <name>fs.default.name</name>
</property>
-->

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>${AWS_ACCESS_KEY_ID}</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>${AWS_SECRET_ACCESS_KEY}</value>
</property>

</configuration>




R Code:

Check out this example from:  h2o/R/tests/testdir_hdfs/runit_s3n_basic.R

s3n_iris_dir <- "0xdata-public/examples/h2o/R/datasets"
url2 <- sprintf("s3n://%s", s3n_iris_dir)
irisdir.hex <- h2o.importHDFS(conn, url2)


Message has been deleted

Jeremy Davis

unread,
Jan 26, 2014, 3:12:15 PM1/26/14
to h2os...@googlegroups.com, Jeremy Davis
Hi Sri,
As always, just trying things out, no immediate use case, but something like Scalding, yes.
I'm hoping Scala's REPL and functional style will be a more natural fit for me (vs R).

-JD

Jeremy Davis

unread,
Jan 26, 2014, 4:57:43 PM1/26/14
to h2os...@googlegroups.com, Jeremy Davis
Tom,
Thanks, that works great!
Reply all
Reply to author
Forward
0 new messages