drRead.csv error on hdfs

18 views
Skip to first unread message

Trenton Charles Pulsipher

unread,
Dec 18, 2015, 12:28:38 PM12/18/15
to Ryan & Brandage Hafen, tesser...@googlegroups.com
I’m just starting to use a Hortonworks Hadoop system with RHIPE and datadr loaded onto it, and now I’m trying to read in data using drRead.csv(). Can you comment on or determine why I receive the following error? I have a 8 Gb csv file that I moved to hdfs (using rhput()) that I’m trying to read into a hdfsConn, see my code below. I have set permissions for both the file and hdfsConn to 777 using rhchmod(). I can’t figure out why it’s returning an error suggesting a localDiskConn be used. Thoughts?
Trenton


> rhput("/home/admtcp1/ABP/data/teaching_item_for_analysis.csv", "/user/admtcp1/ABP/data/teaching_item_for_analysis.csv")

> rhexists("/user/admtcp1/ABP/data/teaching_item_for_analysis.csv")

[1] TRUE


> hdfsTI = hdfsConn("/user/admtcp1/ABP/data/teachingItems/TI", autoYes = T, type = "text", reset = T)
> rhchmod("/user/admtcp1/ABP/data/teachingItems/TI", 777)

> hdfsTI

hdfsConn connection

  loc=/user/admtcp1/ABP/data/teachingItems/TI; type=text

> rhls("/user/admtcp1/ABP/data/teachingItems")

  permission   owner group size          modtime                                    file

1 drwxrwxrwx admtcp1  hdfs    0 2015-12-18 09:19 /user/admtcp1/ABP/data/teachingItems/TI

 

> f = drRead.csv("/user/admtcp1/ABP/data/teaching_item_for_analysis.csv",

+     output = hdfsTI, colClasses = "character", header = T)

Error in drRead.table(file = file, header = header, sep = sep, quote = quote,  : 

  output must be a localDiskConn object with input text file on disk


Ryan Hafen

unread,
Dec 18, 2015, 12:36:54 PM12/18/15
to Trenton Charles Pulsipher, tesser...@googlegroups.com
hdfsTI should be your input.  Your output should be an empty hdfs connection.

Try something like:

f <- drRead.csv(hdfsTI, output = hdfsConn("/user/admtcp1/ABP/data/output"),
  colClasses = "character", header = T)

If you supply simply a string as the input to drRead.csv, it will think by default that it should find this file on your local disk.

Trenton Charles Pulsipher

unread,
Dec 18, 2015, 12:39:40 PM12/18/15
to Ryan Hafen, tesser...@googlegroups.com
So I guess I’m confused. hdfsTI is my output, but how do I point it to the csv file?
Trenton

Ryan Hafen

unread,
Dec 18, 2015, 12:43:13 PM12/18/15
to Trenton Charles Pulsipher, tesser...@googlegroups.com
Sorry maybe I’m confused.  It looks like you are copying your text file to hdfs at the location that hdfsTI points to.  If that’s the case, this is your input.  Your output will be a native R ddf that needs to go somewhere else on hdfs.  I apologize if I’m missing something.

Trenton Charles Pulsipher

unread,
Dec 18, 2015, 1:03:54 PM12/18/15
to Ryan Hafen, tesser...@googlegroups.com
No I think it’s me that’s confused. What would it normally look like? As I understand it, the csv file must reside in a hdfs folder (not connection) so that drRead.csv() will output a hdfsConn. So how does that look like? Could you correct the follow pseudo code?  Thanks for your help,
Trenton

hdfsConnInput = hdfsConn(“/hdfsDir/input/“, autoYes = T)
rhput(“/localDir/textFile.txt”, “/hdfsDir/input/textFile.txt”)
hdfsConnOutput = hdfsConn(“/hdfsDir/output”)
out = drRead.table(hdfsConnInput, output = hdfsConnOutput)

Trenton Charles Pulsipher

unread,
Dec 18, 2015, 3:07:30 PM12/18/15
to Ryan Hafen, tesser...@googlegroups.com

Here’s my code as it stands now. I saw Jeremiah’s comments on https://github.com/tesseradata/datadr/issues/73 and believe I’m implementing things as I should (believe is the key word).  Let me know if this helps at all. 

Trenton



library(Rhipe)

rhinit()

library(matlab)

library(datadr)

library(trelliscope)

library(ggplot2)

library(reshape2)

library(mapproj)


## create the HDFS connection ##

rhmkdir("/user/admtcp1/ABP/data/teachingItems", 777)

rhchmod("/user/admtcp1/ABP/data/teachingItems", 777)


# input #

hdfsTIcsv = hdfsConn("/user/admtcp1/ABP/data/teachingItems/TIcsv", autoYes = T, type = "text")

##  push teachingItems file to hdfs  ##

rhput("/home/admtcp1/ABP/data/teaching_item_for_analysis.csv",

    "/user/admtcp1/ABP/data/teachingItems/TIcsv/teaching_item_for_analysis.csv")

rhchmod("/user/admtcp1/ABP/data/teachingItems/TIcsv/teaching_item_for_analysis.csv", 777)


# output #

hdfsTI = hdfsConn("/user/admtcp1/ABP/data/teachingItems/TI", autoYes = T)


f = drRead.csv(hdfsTIcsv, output = hdfsConn("/user/admtcp1/ABP/data/teachingItems/TI", autoYes = T), 

+     colClasses = "character", header = T)

* Attempting to create directory... success

* Saving connection attributes

* Directory is empty... move some data in here

* testing read on a subset... ok

* Reading in existing 'ddo' attributes

Saving 11 parameters to /tmp/rhipe-temp-params-e4001e6269dfbac9d18bd89681c9bf23 (use rhclean to delete all temp files)

[Fri Dec 18 12:32:07 2015] Name:2015-12-18 12:32:06 Job: job_1447490888639_0124  State: PREP Duration: 0.277

URL: http://l13187.ldschurch.org:8088/proxy/application_1447490888639_0124/

       pct numtasks pending running complete killed failed_attempts killed_attempts

map      0        0       0       0        0      0               0               0

reduce   0        0       0       0        0      0               0               0

Waiting 5 seconds

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 

  java.io.FileNotFoundException: File does not exist: /user/admtcp1/ABP/data/teachingItems/TI

In addition: Warning message:

In Rhipe:::rhwatch.runner(job = job, mon.sec = mon.sec, readback = readback,  :

  Job failure, deleting output: /user/admtcp1/ABP/data/teachingItems/TI:

> traceback()

18: stop(list(message = "java.io.FileNotFoundException: File does not exist: /user/admtcp1/ABP/data/teachingItems/TI", 

        call = .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", 

            cl, .jcast(if (inherits(o, "jobjRef") || inherits(o, 

                "jarrayRef")) o else cl, "java/lang/Object"), .jnew("java/lang/String", 

                method), j_p, j_pc, use.true.class = TRUE, evalString = simplify, 

            evalArray = FALSE), jobj = <S4 object of class "jobjRef">))

17: .Call(RJavaCheckExceptions, silent)

16: .jcheck(silent = FALSE)

15: .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, 

        .jcast(if (inherits(o, "jobjRef") || inherits(o, "jarrayRef")) o else cl, 

            "java/lang/Object"), .jnew("java/lang/String", method), 

        j_p, j_pc, use.true.class = TRUE, evalString = simplify, 

        evalArray = FALSE)

14: .jrcall(x, name, ...)

13: rhoptions()$server$rhdel(folder)

12: rhdel(ofolder)

11: Rhipe:::rhwatch.runner(job = job, mon.sec = mon.sec, readback = readback, 

        ...)

10: rhwatch(setup = setup, map = map, reduce = reduce, input = rhfmt(locs, 

        type = types[1]), output = outFile, mapred = control$mapred, 

        readback = FALSE, parameters = params)

9: mrExecInternal.kvHDFSList(data, setup = setup, map = map, reduce = reduce, 

       output = output, control = control, params = params)

8: mrExecInternal(data, setup = setup, map = map, reduce = reduce, 

       output = output, control = control, params = params)

7: mrExec(ddo(file), map = map, reduce = 0, control = control, output = output, 

       overwrite = overwrite, params = c(params, parList), packages = packages)

6: inherits(conn, "ddo")

5: ddf(mrExec(ddo(file), map = map, reduce = 0, control = control, 

       output = output, overwrite = overwrite, params = c(params, 

           parList), packages = packages))

4: readTable.hdfsConn(file, rowsPerBlock, skip, header, hd, hdText, 

       readTabParams, postTransFn, output, overwrite, params, packages, 

       control)

3: readTable(file, rowsPerBlock, skip, header, hd, hdText, readTabParams, 

       postTransFn, output, overwrite, params, packages, control)

2: drRead.table(file = file, header = header, sep = sep, quote = quote, 

       dec = dec, fill = fill, comment.char = comment.char, ...)

1: drRead.csv(hdfsTIcsv, output = hdfsConn("/user/admtcp1/ABP/data/teachingItems/TI", 

       autoYes = T), colClasses = "character", header = T)

Trenton Charles Pulsipher

unread,
Dec 18, 2015, 5:33:42 PM12/18/15
to Ryan Hafen, tesser...@googlegroups.com
Per the error posting guidelines, here is some additional information regarding the error I experience:

Hadoop flavor is Hortonworks (I think HDP 2.3.2)
Attached is the testdump.rda file when using "options(error = quote(dump.frames("testdump", TRUE)))” before running the code.

> library(Rhipe)

Loading required package: codetools

Loading required package: rJava

------------------------------------------------

| Please call rhinit() else RHIPE will not run |

------------------------------------------------

> rhinit()

Rhipe: Using Rhipe.jar file

Initializing Rhipe v0.75.1.4

2015-12-18 14:51:18,678 WARN  [main][NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2015-12-18 14:51:20,978 WARN  [main][DomainSocketFactory] The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

Initializing mapfile caches

> library(datadr)

> sessionInfo()

R version 3.1.3 (2015-03-09)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: Red Hat Enterprise Linux Server 7.1 (Maipo)


locale:

 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 

 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       

 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      

 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          

 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     

[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8


attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     


other attached packages:

[1] datadr_0.7.5.9   Rhipe_0.75.1.4   rJava_0.9-7      codetools_0.2-10


loaded via a namespace (and not attached):

 [1] assertthat_0.1   chron_2.3-47     data.table_1.9.6 DBI_0.3.1       

 [5] digest_0.6.8     dplyr_0.4.3      grid_3.1.3       hexbin_1.27.1   

 [9] lattice_0.20-33  magrittr_1.5     parallel_3.1.3   R6_2.1.1        

[13] Rcpp_0.12.1      tools_3.1.3     

testdump.rda

Ryan Hafen

unread,
Dec 19, 2015, 1:50:13 AM12/19/15
to Trenton Charles Pulsipher, tesser...@googlegroups.com
Your code looks correct.  I think there may have been some Hadoop error.  Have you successfully run a RHIPE job on this cluster yet?  If not, you can run on with the following:

library(Rhipe)
rhinit()
library(testthat)
test_package("Rhipe", "simple")

This expects that you can write to /tmp on hdfs.

If this doesn’t work, there’s a good chance the issue is not datadr but something with your configuration.  Let me know what the output of this job is and we can see if we can figure out what’s up with the config.




<testdump.rda>

Reply all
Reply to author
Forward
0 new messages