Installing plyrmr and receiving error msg in mtcars tutorial

Ashok Kumar Harnal

unread,

Nov 11, 2013, 3:03:01 AM11/11/13

to rha...@googlegroups.com

I installed plyrmr (successfully) but am unable to run mtcars tutorial.

A. This is how I updated rmr2:

As I had a version of rmr2 less than 2.3, first updated rmr2 as follows:
# Download rmr2_2.3.0.tar.gz to a local directory, and install it as:
>install.packages("/home/ashokharnal/Downloads/rmr2_2.3.0.tar.gz", repos = NULL, type="source")

The updation was successful as mapreduce jobs ran successfully.

B. This is how I installed plyrmr

Step 1:
# First install devtools
install.packages("devtools")
require(devtools)

Step 2:
# As i am behind a proxy, proxy server's address is needed for downloading from github
require(httr)
set_config(use_proxy(url="192.168.1.254", port=6588))
# Now install pryr
install_github("pryr", "hadley")

# Step 3: Install two other needed dependencies
install.packages("R.methodsS3")
install.packages("hydroPSO")

Step 4: Finally install the downloaded plyrmr package.
# Download plyrmr_0.1.0.tar.gz and install it locally as:
install.packages("/home/ashokharnal/Downloads/plyrmr_0.1.0.tar.gz", repos = NULL, type="source")

Installation did not give any error message. My platform is CentoOS 6.4

C. The following are the steps taken to run mtcars tutorial:

>data(mtcars)

# to copy to hdfs system first write the data in a local file
> write.table(mtcars,"mtcars",quote=FALSE)

# Through a shell command, mtcars local file was written to /user/test/mtcars
# Here is the output of hdfs file:

> head(hdfs.cat("/user/test/mtcars"))
[1] "mpg cyl disp hp drat wt qsec vs am gear carb"
[2] "Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4"
[3] "Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4"
[4] "Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1"
[5] "Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1"
[6] "Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2"

# Next use plyrmr library and transform the file

> library(plyrmr)
> as.data.frame(transform(input("/user/test//mtcars"), carb.per.cyl = carb/cyl))

# The mapreduce output gives an error message. Its output is as below:

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
     hadoop streaming failed with error code 1
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.

( Incidentally other mapreduce jobs that do not make use of plyrmr package do run successfully
AND the following command does give the correct message:

> transform(input("/user/test/mtcars"), carb.per.cyl = carb/cyl)

[1] "Got it! To generate results call the functions output or as.data.frame on this object. Computation has been delayed at least in part." )

I will be grateful for help. Thanks.

Ashok Kumar Harnal

unread,

Nov 11, 2013, 3:31:24 AM11/11/13

to rha...@googlegroups.com

Inadvertently in the line:

> as.data.frame(transform(input("/user/test//mtcars"), carb.per.cyl = carb/cyl))

there is a double backslash. But making it single as follows, again gives same error and no relief.

> as.data.frame(transform(input("/user/test/mtcars"), carb.per.cyl = carb/cyl))

Ashok Kumar Harnal

Antonio Piccolboni

unread,

Nov 11, 2013, 11:39:30 AM11/11/13

to RHadoop Google Group

Could you report on the messages in stderr of the failing process per debugging instructions? I know I sound like a broken record, but cutting and pasting what you see in console doesn't really help unless you are running in standalone mode. Did you install plyrmr on all of your clusters' nodes? I suppose you know the drill from installing rmr2, but just to be sure. Thanks

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ashok Kumar Harnal

unread,

Nov 11, 2013, 8:38:59 PM11/11/13

to rha...@googlegroups.com

Dear Antonio

I am running hadoop from CDH4. I have checked up, it is running in pseudo-distributed mode.
I am unable to attach stderr file: The outputs are as below:

stderr output from one attempt
___________________________________________________________________

Loading objects:
.Last
.Random.seed
binary
count
map
mydata
myformat
mylogit
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
keyval.length
libs
map
Warning: namespace ‘plyrmr’ is not available and has been replaced
by .GlobalEnv when processing object ‘map’
map.file
map.line
out.folder
output.format
pkg.opts
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘plyrmr’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load plyrmr
Loading required package: hydroPSO
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘hydroPSO’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load hydroPSO
Loading required package: R.methodsS3
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘R.methodsS3’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load R.methodsS3
Loading required package: pryr
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘pryr’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load pryr
Loading required package: rhdfs
Loading required package: methods
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load rhdfs
Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr

Attaching package: ‘plyr’

The following object is masked _by_ ‘.GlobalEnv’:

    count

Loading required package: reshape2
Error in map(keys(kv), values(kv)) : could not find function "safe.cbind"
Calls: <Anonymous> -> <Anonymous> -> as.keyval -> is.keyval -> map
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

________________________________________________________________________

stderr output from another attempt but same job
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Loading objects:
.Last
.Random.seed
binary
count
map
mydata
myformat
mylogit
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
keyval.length
libs
map
Warning: namespace ‘plyrmr’ is not available and has been replaced
by .GlobalEnv when processing object ‘map’
map.file
map.line
out.folder
output.format
pkg.opts
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘plyrmr’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load plyrmr
Loading required package: hydroPSO
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘hydroPSO’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load hydroPSO
Loading required package: R.methodsS3
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘R.methodsS3’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load R.methodsS3
Loading required package: pryr
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘pryr’
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load pryr
Loading required package: rhdfs
Loading required package: methods
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("graphics", "grDevices", "utils", "datasets", "plyrmr", :
can't load rhdfs
Loading required package: rmr2
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr

Attaching package: ‘plyr’

The following object is masked _by_ ‘.GlobalEnv’:

    count

Loading required package: reshape2
Error in map(keys(kv), values(kv)) : could not find function "safe.cbind"
Calls: <Anonymous> -> <Anonymous> -> as.keyval -> is.keyval -> map
Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The console output is as below:

> as.data.frame(transform(input("/user/test/mtcars"), carb.per.cyl = carb/cyl))

packageJobJar: [/tmp/RtmpeD4xQ1/rmr-local-env307310da66b6, /tmp/RtmpeD4xQ1/rmr-global-env307378551d85, /tmp/RtmpeD4xQ1/rmr-streaming-map30733d28e1c2, /tmp/hadoop-ashokharnal/hadoop-unjar1040037147401611408/] [] /tmp/streamjob7463591244593531676.jar tmpDir=null
13/11/12 06:54:06 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/12 06:54:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/12 06:54:06 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-ashokharnal/mapred/local]
13/11/12 06:54:06 INFO streaming.StreamJob: Running job: job_201311120553_0002
13/11/12 06:54:06 INFO streaming.StreamJob: To kill this job, run:
13/11/12 06:54:06 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201311120553_0002
13/11/12 06:54:06 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201311120553_0002
13/11/12 06:54:07 INFO streaming.StreamJob: map 0% reduce 0%
13/11/12 06:54:56 INFO streaming.StreamJob: map 100% reduce 100%
13/11/12 06:54:56 INFO streaming.StreamJob: To kill this job, run:
13/11/12 06:54:56 INFO streaming.StreamJob: /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201311120553_0002
13/11/12 06:54:56 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201311120553_0002
13/11/12 06:54:56 ERROR streaming.StreamJob: Job not successful. Error: NA
13/11/12 06:54:56 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead.
Moved: 'hdfs://master:8020/tmp/RtmpeD4xQ1/file30731121949b' to trash at: hdfs://master:8020/user/ashokharnal/.Trash/Current

Thanks for help.

Ashok Kumar Harnal

On Monday, November 11, 2013 1:33:01 PM UTC+5:30, Ashok Kumar Harnal wrote:

Antonio Piccolboni

unread,

Nov 11, 2013, 8:50:52 PM11/11/13

to RHadoop Google Group

It looks like plyrmr is not installed, and neither are some of the dependencies. Either you have multiple nodes and didn't finish the installation or you installed to a path that's not accessible to the account executing mapreduce jobs. The latter is more likely since you said you are working in pseudo distributed mode. Normally the mr account is different from the one people log in with, so it's possible you can load plyrmr interactively but Hadoop can't find it. Installing to a standard system location is the recommended solution.

Antonio

Ashok Kumar Harnal

unread,

Nov 12, 2013, 3:40:46 AM11/12/13

to rha...@googlegroups.com

Dear Antonio,

I have reinstalled plyrmr. This reinstallation was done while R was running in my user account. The reinstallation is in the sense
that I followed the installation procedure for plyrmr again without uninstalling anything.

I have an .Rprofile file that gets loaded before I install plyrmr. The file is:

Sys.setenv(HADOOP_HOME="/usr/lib/hadoop")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
library(rmr2)
library(rhdfs)
hdfs.init()
#library(plyrmr)

The installation steps are:

>install.packages("/home/ashokharnal/Downloads/rmr2_2.3.0.tar.gz", repos = NULL, type="source")

>install.packages("devtools")
>require(devtools)

>install_github("pryr", "hadley")

>install.packages("R.methodsS3")
>install.packages("hydroPSO")

>install.packages("/home/ashokharnal/Downloads/plyrmr_0.1.0.tar.gz", repos = NULL, type="source")

Installation did not give any error message. Though installation of hydroPSO does give some warning messages.
Mapreduce not requiring plyrmr continues to run.

Thanks for help,

Ashok Kumar Harnal

On Monday, November 11, 2013 1:33:01 PM UTC+5:30, Ashok Kumar Harnal wrote:

Ashok Kumar Harnal

unread,

Nov 12, 2013, 3:44:06 AM11/12/13

to rha...@googlegroups.com

Sorry, to complete the above message, I still get the same errors while running mtcars tutorial.
Thanks

On Monday, November 11, 2013 1:33:01 PM UTC+5:30, Ashok Kumar Harnal wrote:

Antonio Piccolboni

unread,

Nov 12, 2013, 12:26:55 PM11/12/13

to RHadoop Google Group

From your previously submitted standard error log, it seems R can't find plyrmr. So as far as I understand these things, plyrmr could be written in prolog and it wouldn't make a difference, R can't find it, so it can't load it, independent of contents. As far as repeating installation steps, we need to change something: repetition will not help. If you have a single node cluster (probably a secret since you are not telling us) that eliminates one line of inquiry (did you forget a node?). The other is a user issue. You are installing as user analyst (just to make an example) but mapreduce jobs run as user "yarn". Since you did not install with superuser privileges, packages are under ~analyst/R/lib or some such. No way for R run by user yarn to see them, hence for R as run in a mapreduce job. This is just a possibility, to figure this out please enter at the R prompt

library(plyrmr)

system.file(package= "plyrmr")

and ask yourself if the returned directory is visible to every user.

Antonio

Ashok Kumar Harnal

unread,

Nov 13, 2013, 1:30:09 AM11/13/13

to rha...@googlegroups.com

Dear Antonio,

Thanks for helping.

I am having CDH4 installed on a single machine and working on pseudo distributed mode. Now this is what I did.

a. Uninstalled R as root user. (To uninstall R, various R components had to be separately erased. And then rebooted machine.)
b. Re-installed R as root user
c. Re-installed RHadoop and plyrmr as root user
d. While in root, re-invoked R and executed mapreduce job that did not use plyrmr. The job executed normally but its stderr file reported ONE error. The stderr file output is as below:

stderr output when plyrmr is not in use but mapreduce functions normally:
----------------------------------------------------------------------------------------------------------------------------
Loading objects:
.Last
closingSum
map
myformat
openingSum

Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
keyval.length
libs
map

map.file
map.line
out.folder
output.format
pkg.opts
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr

Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: methods

Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr

Loading required package: reshape2
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.5.2 (2013-10-06) successfully loaded. See ?R.methodsS3 for help.
Loading required package: hydroPSO
(C) 2011-2013 M. Zambrano-Bigiarini and R. Rojas (GPL >=2 license)
Type 'citation('hydroPSO')' to see how to cite this package

Attaching package: ‘plyrmr’

The following object is masked from ‘package:pryr’:

    where

The following object is masked from ‘package:rmr2’:

    gather

The following object is masked from ‘package:reshape2’:

    dcast

The following objects are masked from ‘package:plyr’:

    mutate, summarize

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, union

Loading required package: rhdfs

Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs

Warning in FUN(c("plyrmr", "hydroPSO", "R.methodsS3", "pryr", "rhdfs", "rJava", :
can't load rhdfs

----------------------------------------------------------------------------------------------------------------------------------------------------------

Then I ran transform job that would require plyrmr. The mapreduce job aborted with errors. The stderr message
listed TWO errors. These are as below:

stderr output when plyrmr is being used. mapreduce job terminated with error. There are now two error messages.
-----------------------------------------------------------------------------------------------------------------------------------------------------------

Loading objects:
.Last

Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
keyval.length
libs
map

map.file
map.line
out.folder
output.format
pkg.opts
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
vectorized.reduce
verbose
work.dir
Loading required package: plyrmr

Loading required package: Rcpp
Loading required package: rmr2
Loading required package: RJSONIO
Loading required package: methods

Loading required package: bitops
Loading required package: digest
Loading required package: stringr
Loading required package: plyr

Loading required package: reshape2
Loading required package: pryr
Loading required package: R.methodsS3
R.methodsS3 v1.5.2 (2013-10-06) successfully loaded. See ?R.methodsS3 for help.
Loading required package: hydroPSO
(C) 2011-2013 M. Zambrano-Bigiarini and R. Rojas (GPL >=2 license)
Type 'citation('hydroPSO')' to see how to cite this package

Attaching package: ‘plyrmr’

The following object is masked from ‘package:pryr’:

    where

The following object is masked from ‘package:rmr2’:

    gather

The following object is masked from ‘package:reshape2’:

    dcast

The following objects are masked from ‘package:plyr’:

    mutate, summarize

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, union

Loading required package: rhdfs

Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs

Warning in FUN(c("plyrmr", "hydroPSO", "R.methodsS3", "pryr", "rhdfs", "rJava", :
can't load rhdfs
Error in eval(expr, envir, enclos) : object 'carb' not found
Calls: <Anonymous> ... transform.default -> transform.data.frame -> eval -> eval

Execution halted
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

------------------------------------------------------------------------------------------------------------------------------------------------------------

Thanks for help.

Ashok Kumar Harnal

On Monday, November 11, 2013 1:33:01 PM UTC+5:30, Ashok Kumar Harnal wrote:

Antonio Piccolboni

unread,

Nov 13, 2013, 2:24:50 AM11/13/13

to RHadoop Google Group

Great, now we've got the installation problems out of the way. I am not sure what you mean by the mtcars tutorial exactly if this is your code I have some observations.

>data(mtcars)
# to copy to hdfs system first write the data in a local file
> write.table(mtcars,"mtcars",quote=FALSE)

So you are writing in csv format here.

# Through a shell command, mtcars local file was written to /user/test/mtcars
# Here is the output of hdfs file:
> head(hdfs.cat("/user/test/mtcars"))
[1] "mpg cyl disp hp drat wt qsec vs am gear carb"
[2] "Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4"
[3] "Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4"
[4] "Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1"
[5] "Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1"
[6] "Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2"

And in fact here it is in csv format

# Next use plyrmr library and transform the file
> library(plyrmr)
> as.data.frame(transform(input("/user/test//mtcars"), carb.per.cyl = carb/cyl))

Here you are not specifying the format, so plyrmr is assuming native format. It can't work. Try

input("/user/test/mtcars", format = "csv")

keep the rest the same and see if it works . If you change the tutorial to perform your own experiments, that's great but you've got to own up to the changes.

Antonio

Reply all

Reply to author

Forward