Hi,
I have previously posted an almost identical question
regarding an error when using h2o.importFolder function with the pattern
parameter. That problem was resolved in my macbook when I finally downloaded the
version 3.6.0.8 of h2o (I was earlier using version 3.0.0.25) as well as
providing the correct regex expression for the pattern.
However, strangely, I am now faced with a different error
when using the same code in a Cloudera Quickstart VM (CentOS release 6.7). The
dependent packages were all updated to reflect the same version as the macbook.
Sample code I use to recreate this error in the VM is as follows:
library(DMwR)
library(rmr2)
library(h2o)
data(algae)
# upload data to hdfs
algae.rh <- to.dfs(keyval(NULL,algae))
# this gives output files hdfs://localhost.localdomain:8020/user/cloudera/test1/part-00000
and hdfs://localhost.localdomain:8020/user/cloudera/test1/part-00001
a <- mapreduce(input=algae.rh,input.format =
'native',map= function(k,v){return(keyval(v$season,v))},output="hdfs://localhost.localdomain:8020/user/cloudera/test1",output.format=make.output.format("csv",sep=","))
h2oInstance <- h2o.init(ip =
"localhost.localdomain", port = 54321,nthreads = -1)
# this will throw an error in Cloudera VM(CentOS) but works in
a mac (after changing the path)
b.h2o <-
h2o.importFolder(path =
"hdfs://localhost.localdomain:8020/user/cloudera/test1", pattern =
'part-[:digit:]{5}')
# this correctly inputs the data in part-00000
b.h2o <-
h2o.importFolder(path =
"hdfs://localhost.localdomain:8020/user/cloudera/test1/part-00000")
# But even this throws an error
b.h2o <-
h2o.importFolder(path =
"hdfs://localhost.localdomain:8020/user/cloudera/test1", pattern =
'part-00000')
There is probably a simple solution as the final line in the
error message says “water.exceptions.H2OParseSetupException:
Column separator mismatch. One file seems to use " " and the other
uses ",".
Due to the above it seems to me that the h2o.importFolder
function is unable to filter out the non-matching file names and hence tries to
import other files in the folder along with the csv files, which is probably the
reason for the error message.
I am appending below the sessionInfo() output.
Any help or advice would be much appreciated.
Thank you.
Anand
Full error message:
ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost.localdomain:54321/3/ParseSetup)
java.lang.RuntimeException
[1] "water.MRTask.getResult(MRTask.java:505)"
[2] "water.MRTask.doAll(MRTask.java:399)"
[3] "water.parser.ParseSetup.guessSetup(ParseSetup.java:212)"
[4] "water.api.ParseSetupHandler.guessSetup(ParseSetupHandler.java:34)"
[5] "sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"
[6] "sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)"
[7] "sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"
[8] "java.lang.reflect.Method.invoke(Method.java:606)"
[9] "water.api.Handler.handle(Handler.java:64)"
[10] "water.api.RequestServer.handle(RequestServer.java:644)"
[11] "water.api.RequestServer.serve(RequestServer.java:585)"
[12] "water.JettyHTTPD$H2oDefaultServlet.doGeneric(JettyHTTPD.java:617)"
[13] "water.JettyHTTPD$H2oDefaultServlet.doPost(JettyHTTPD.java:565)"
[14] "javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"
[15] "javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"
[16] "org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
water.DException$DistributedException: from /172.17.135.120:54321; by class water.parser.ParseSetup$GuessSetupTsk; class water.exceptions.H2OParseSetupException: Column separator mismatch. One file seems to use " " and the other uses ",".
sessionInfo() output:
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-6 forecast_6.2 timeDate_3012.100 Rcpp_0.12.2 plyr_1.8.3
[6] functional_0.6 h2o_3.6.0.8 statmod_1.4.22 xts_0.9-7 zoo_1.7-11
[11] data.table_1.9.6
loaded via a namespace (and not attached):
[1] rmr2_3.0.0 colorspace_1.2-4 lattice_0.20-33 quadprog_1.5-5 tools_3.2.2 nnet_7.3-10
[7] parallel_3.2.2 grid_3.2.2 tseries_0.10-34 bitops_1.0-6 RCurl_1.95-4.7 fracdiff_1.4-2
[13] jsonlite_0.9.19 chron_2.3-45